Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transport client infinite retry #5675

Closed
magnhaug opened this issue Apr 3, 2014 · 1 comment
Closed

Transport client infinite retry #5675

magnhaug opened this issue Apr 3, 2014 · 1 comment

Comments

@magnhaug
Copy link

magnhaug commented Apr 3, 2014

Firstly, the onFailure method in TransportClientNodesService.RetryListener has a weak if-check:
if (i == nodes.size()) { /*...*/ }
If the variable i is allowed to progress higher than nodes.size(), we can enter a endlessly recursive loop. This is reproducible in unit tests if we are throwing ConnectTransportException from every node, as the code can re-enter onFailure after "terminating".

Secondly, the reason we discovered this, is that the Java client might (and will, quite consistently) end up forking out infinite threads that loop on the onFailure method.
In our app this happens most frequently when we hit a major GC on the filter cache, while simultaneously maintaining about 40 threads of heavy read- and write- operations. We've also seen it happening when one of our two test nodes is shutting down.

I've attached a sample from a stack trace below. This is pulled from a live system, right as it suffered this error and started spawning threads spinning on RetryListener.
Pay attention to three things:

  1. This is a stack trace from the client, when the server is irresponsive due to GC.
  2. The thread numbers. This is just a tiny sample, there are thousands of these threads (as many as the client can allocate before hitting ulimit). We are doing a max of 40 simultaneous synchronous searches/writes, and would not expect any more simultaneous retries.
  3. The number of recursive calls to onFailure in the second stacktrace. This is from a cluster with 2 nodes and 2 client apps. It should not recurse 10 times.

Alas, the problem is two-fold as I see it:

  • Why does the client fork infinite threads upon a ConnectTransportException?
  • Why does each loop potentially recurse deeper than the number of nodes?

Also, as a sidenote, the prefix increment of the volatile i variable is not threadsafe, you would probably use an AtomicInteger? Although I do not understand why several threads would want to re-use the same RetryListener instance.

This is on ElasticSearch 0.90.7, on Java 7u25 or 7u45

"elasticsearch[Matt Murdock][generic][T#2023]" daemon prio=10 tid=0x00007f1fe5551800 nid=0x62fe waiting for monitor entry [0x00007f1f1d198000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at java.lang.ThreadLocal.createInheritedMap(ThreadLocal.java:236)
        at java.lang.Thread.init(Thread.java:415)
        at java.lang.Thread.init(Thread.java:349)
        at java.lang.Thread.<init>(Thread.java:674)
        at org.elasticsearch.common.util.concurrent.EsExecutors$EsThreadFactory.newThread(EsExecutors.java:102)
        at java.util.concurrent.ThreadPoolExecutor$Worker.<init>(ThreadPoolExecutor.java:610)
        at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:924)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:203)
        at org.elasticsearch.action.TransportActionNodeProxy.execute(TransportActionNodeProxy.java:68)
        at org.elasticsearch.client.transport.support.InternalTransportClient$2.doWithNode(InternalTransportClient.java:109)
        at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:259)
        at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
        at org.elasticsearch.action.TransportActionNodeProxy$1.handleException(TransportActionNodeProxy.java:89)
        at org.elasticsearch.transport.TransportService$2.run(TransportService.java:206)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)

   Locked ownable synchronizers:
        - <0x0000000791ef8570> (a java.util.concurrent.ThreadPoolExecutor$Worker)

"elasticsearch[Matt Murdock][generic][T#2506]" daemon prio=10 tid=0x00007f1fe554f800 nid=0x62fd runnable [0x00007f1f1d299000]
   java.lang.Thread.State: RUNNABLE
        at org.elasticsearch.action.TransportActionNodeProxy.execute(TransportActionNodeProxy.java:68)
        at org.elasticsearch.client.transport.support.InternalTransportClient$2.doWithNode(InternalTransportClient.java:109)
        at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:259)
        at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
        at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
        at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
        at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
        at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
        at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
        at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
        at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
        at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
        at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
        at org.elasticsearch.action.TransportActionNodeProxy$1.handleException(TransportActionNodeProxy.java:89)
        at org.elasticsearch.transport.TransportService$2.run(TransportService.java:206)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)

   Locked ownable synchronizers:
        - <0x0000000793b97d08> (a java.util.concurrent.ThreadPoolExecutor$Worker)


etc, etc, etc, etc..
@bleskes
Copy link
Contributor

bleskes commented Apr 7, 2014

This has been fixed as part of the work for #4162 , which was included in 0.90.8

I'm closing for now. If you are still seeing problems after upgrading please feel free to reopen

@bleskes bleskes closed this as completed Apr 7, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants