You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Firstly, the onFailure method in TransportClientNodesService.RetryListener has a weak if-check: if (i == nodes.size()) { /*...*/ }
If the variable i is allowed to progress higher than nodes.size(), we can enter a endlessly recursive loop. This is reproducible in unit tests if we are throwing ConnectTransportException from every node, as the code can re-enter onFailure after "terminating".
Secondly, the reason we discovered this, is that the Java client might (and will, quite consistently) end up forking out infinite threads that loop on the onFailure method.
In our app this happens most frequently when we hit a major GC on the filter cache, while simultaneously maintaining about 40 threads of heavy read- and write- operations. We've also seen it happening when one of our two test nodes is shutting down.
I've attached a sample from a stack trace below. This is pulled from a live system, right as it suffered this error and started spawning threads spinning on RetryListener.
Pay attention to three things:
This is a stack trace from the client, when the server is irresponsive due to GC.
The thread numbers. This is just a tiny sample, there are thousands of these threads (as many as the client can allocate before hitting ulimit). We are doing a max of 40 simultaneous synchronous searches/writes, and would not expect any more simultaneous retries.
The number of recursive calls to onFailure in the second stacktrace. This is from a cluster with 2 nodes and 2 client apps. It should not recurse 10 times.
Alas, the problem is two-fold as I see it:
Why does the client fork infinite threads upon a ConnectTransportException?
Why does each loop potentially recurse deeper than the number of nodes?
Also, as a sidenote, the prefix increment of the volatile i variable is not threadsafe, you would probably use an AtomicInteger? Although I do not understand why several threads would want to re-use the same RetryListener instance.
This is on ElasticSearch 0.90.7, on Java 7u25 or 7u45
"elasticsearch[Matt Murdock][generic][T#2023]" daemon prio=10 tid=0x00007f1fe5551800 nid=0x62fe waiting for monitor entry [0x00007f1f1d198000]
java.lang.Thread.State: BLOCKED (on object monitor)
at java.lang.ThreadLocal.createInheritedMap(ThreadLocal.java:236)
at java.lang.Thread.init(Thread.java:415)
at java.lang.Thread.init(Thread.java:349)
at java.lang.Thread.<init>(Thread.java:674)
at org.elasticsearch.common.util.concurrent.EsExecutors$EsThreadFactory.newThread(EsExecutors.java:102)
at java.util.concurrent.ThreadPoolExecutor$Worker.<init>(ThreadPoolExecutor.java:610)
at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:924)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:203)
at org.elasticsearch.action.TransportActionNodeProxy.execute(TransportActionNodeProxy.java:68)
at org.elasticsearch.client.transport.support.InternalTransportClient$2.doWithNode(InternalTransportClient.java:109)
at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:259)
at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
at org.elasticsearch.action.TransportActionNodeProxy$1.handleException(TransportActionNodeProxy.java:89)
at org.elasticsearch.transport.TransportService$2.run(TransportService.java:206)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Locked ownable synchronizers:
- <0x0000000791ef8570> (a java.util.concurrent.ThreadPoolExecutor$Worker)
"elasticsearch[Matt Murdock][generic][T#2506]" daemon prio=10 tid=0x00007f1fe554f800 nid=0x62fd runnable [0x00007f1f1d299000]
java.lang.Thread.State: RUNNABLE
at org.elasticsearch.action.TransportActionNodeProxy.execute(TransportActionNodeProxy.java:68)
at org.elasticsearch.client.transport.support.InternalTransportClient$2.doWithNode(InternalTransportClient.java:109)
at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:259)
at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
at org.elasticsearch.action.TransportActionNodeProxy$1.handleException(TransportActionNodeProxy.java:89)
at org.elasticsearch.transport.TransportService$2.run(TransportService.java:206)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Locked ownable synchronizers:
- <0x0000000793b97d08> (a java.util.concurrent.ThreadPoolExecutor$Worker)
etc, etc, etc, etc..
The text was updated successfully, but these errors were encountered:
Firstly, the
onFailure
method inTransportClientNodesService.RetryListener
has a weak if-check:if (i == nodes.size()) { /*...*/ }
If the variable
i
is allowed to progress higher than nodes.size(), we can enter a endlessly recursive loop. This is reproducible in unit tests if we are throwingConnectTransportException
from every node, as the code can re-enter onFailure after "terminating".Secondly, the reason we discovered this, is that the Java client might (and will, quite consistently) end up forking out infinite threads that loop on the
onFailure
method.In our app this happens most frequently when we hit a major GC on the filter cache, while simultaneously maintaining about 40 threads of heavy read- and write- operations. We've also seen it happening when one of our two test nodes is shutting down.
I've attached a sample from a stack trace below. This is pulled from a live system, right as it suffered this error and started spawning threads spinning on RetryListener.
Pay attention to three things:
onFailure
in the second stacktrace. This is from a cluster with 2 nodes and 2 client apps. It should not recurse 10 times.Alas, the problem is two-fold as I see it:
ConnectTransportException
?Also, as a sidenote, the prefix increment of the volatile
i
variable is not threadsafe, you would probably use an AtomicInteger? Although I do not understand why several threads would want to re-use the same RetryListener instance.This is on ElasticSearch 0.90.7, on Java 7u25 or 7u45
The text was updated successfully, but these errors were encountered: