New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed the node retry mechanism which could fail without trying all the connected nodes #6829
Fixed the node retry mechanism which could fail without trying all the connected nodes #6829
Conversation
if (e.unwrapCause() instanceof ConnectTransportException) { | ||
retryListener.onFailure(e); | ||
} else { | ||
if (!(e.unwrapCause() instanceof ConnectTransportException)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment about why we ignore this?
onFailure(e1); | ||
//no need to retry here, the transport service will notify this same listener | ||
//of the failure through the request holder, which will retry | ||
//ConnectTransportException gets thrown as well by TransportService due to throwConnectException option |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment can be clear, maybe:
"ConnectTransportException gets thrown as well by TransportService due to throwConnectException option, which is needed for the correct operation of execute(...). We can ignore it here because it will be passed through the listener interface"
import static org.hamcrest.Matchers.greaterThanOrEqualTo; | ||
|
||
@ClusterScope(scope = TEST, numClientNodes = 0) | ||
public class TransportClientRetryTests extends ElasticsearchIntegrationTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love to have a unit test of TransportClientNodesService, with a custom tranport service implementation that both throw error and checks that all nodes were tried...
The fix look good to me (left some comments regarding comments :) ). I'd love to see a unit test as opposed to an integration test. I think we'd get much more out of it. |
@@ -143,7 +143,7 @@ | |||
|
|||
static final boolean DEFAULT_ENABLE_RANDOM_BENCH_NODES = true; | |||
|
|||
private static final String NODE_MODE = nodeMode(); | |||
public static final String NODE_MODE = nodeMode(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make nodeMode()
public instead?
…hout trying all the connected nodes The RetryListener was notified twice for each single failure, which caused some additional retries, but more importantly was making the client reach the maximum number of retries (number of connected nodes) too quickly, meanwhile ongoing retries which could succeed are not completed yet. The TransportService notifies of any exception received from a separate thread through the request holder, no need to notify the retry listener in any other place (either catch or onFailure method itself). Closes elastic#6829
…ception, which might get thrown before the transport service send request call only ConnectTransportException gets thrown by transport service (throwConnectException option), if another exception gets thrown it comes from something that happened before, we need to notify the original listener and stop the retries.
…te variants: with or without listener. The variant without listener throws exception on the calling thread (throwConnectException option).
…accurate in terms of number of expected nodes
…nsportService won't do it either do notify the listener in case of throwable, cause they can't come from the transport service and the listener needs to be notified of those, otherwise it bubbles up
Pushed new commits to address comments and a unit test for it as suggested by @bleskes . I also changed a bit how we catch exceptions given how they get thrown by the |
…lientNodesService and one a bit higher level that tests InternalTransportClient
…d exception handling in transport client retry mechanism
import java.util.concurrent.CopyOnWriteArraySet; | ||
import java.util.concurrent.atomic.AtomicInteger; | ||
|
||
public abstract class FailAndRetryMockTransport<Response extends TransportResponse> implements Transport { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move this as an internal class to the test that use it? also, throw illegal state exception on implemented methods that should not be called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, it is shared between two different tests, one more low level (TransportClientNodesServiceTests
) and one that tests the blocking version as well and involves the TransportActionNodeProxy
too (InternalTransportClientTests
). I'd make it package private if that's ok and move both tests to the same org.elasticsearch.client.transport
package?
LGTM, very clean now indeed! |
…re not supposed to be used
…hout trying all the connected nodes The RetryListener was notified twice for each single failure, which caused some additional retries, but more importantly was making the client reach the maximum number of retries (number of connected nodes) too quickly, meanwhile ongoing retries which could succeed were not completed yet. The TransportService used to throw ConnectTransportException due to throwConnectException set to true, and also notify the listener of any exception received from a separate thread through the request holder. Simplified exception handling by just removing the throwConnectException option from the TransportService, used only in the transport client. The transport client now relies solely on the request holder to notify of failures and eventually retry. Closes #6829
Side note: as part of the work to fix this issue the |
…hout trying all the connected nodes The RetryListener was notified twice for each single failure, which caused some additional retries, but more importantly was making the client reach the maximum number of retries (number of connected nodes) too quickly, meanwhile ongoing retries which could succeed were not completed yet. The TransportService used to throw ConnectTransportException due to throwConnectException set to true, and also notify the listener of any exception received from a separate thread through the request holder. Simplified exception handling by just removing the throwConnectException option from the TransportService, used only in the transport client. The transport client now relies solely on the request holder to notify of failures and eventually retry. Closes elastic#6829
…hout trying all the connected nodes The RetryListener was notified twice for each single failure, which caused some additional retries, but more importantly was making the client reach the maximum number of retries (number of connected nodes) too quickly, meanwhile ongoing retries which could succeed were not completed yet. The TransportService used to throw ConnectTransportException due to throwConnectException set to true, and also notify the listener of any exception received from a separate thread through the request holder. Simplified exception handling by just removing the throwConnectException option from the TransportService, used only in the transport client. The transport client now relies solely on the request holder to notify of failures and eventually retry. Closes elastic#6829
The RetryListener was notified twice for each single failure, which caused some additional retries, but more importantly was making the client reach the maximum number of retries (number of connected nodes) too quickly, meanwhile ongoing retries which could succeed are not completed yet.
The
TransportService
already notifies the listener of any exception received from a separate thread through the request holder, no need to notify the retry listener again in any other place (either catch oronFailure
method itself).