Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiSearch hangs forever + EsRejectedExecutionException #4887

Closed
karol-gwaj opened this issue Jan 25, 2014 · 19 comments
Closed

MultiSearch hangs forever + EsRejectedExecutionException #4887

karol-gwaj opened this issue Jan 25, 2014 · 19 comments

Comments

@karol-gwaj
Copy link

im getting this error logged after executing many multisearch requests concurrently (snippet below for single multisearch):

MultiSearchRequestBuilder builder = _client.prepareMultiSearch();
for (final String index: indexes)
{
    final SearchRequestBuilder request = _client.prepareSearch(index).setPreference("_local");
    request.setQuery(QueryBuilders.termQuery("group_id", id)).setSize(100).setTimeout("10s");
    builder.add(request);
}

final MultiSearchResponse response = builder.get();

exception call stack:

org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000) on org.elasticsearch.acti
        at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:62)
        at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
        at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.ja
        at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.ja
        at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.ja
        at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.start(TransportSearchTypeAction.java:190)
        at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction.doExecute(TransportSearchQueryThenFetchAction.java:59
        at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction.doExecute(TransportSearchQueryThenFetchAction.java:49
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
        at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:108)
        at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:43)
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
        at org.elasticsearch.action.search.TransportMultiSearchAction.doExecute(TransportMultiSearchAction.java:63)
        at org.elasticsearch.action.search.TransportMultiSearchAction.doExecute(TransportMultiSearchAction.java:39)
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
        at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:92)
        at org.elasticsearch.client.support.AbstractClient.multiSearch(AbstractClient.java:242)
        at org.elasticsearch.action.search.MultiSearchRequestBuilder.doExecute(MultiSearchRequestBuilder.java:79)
        at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:85)
        at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:59)
        at org.elasticsearch.action.ActionRequestBuilder.get(ActionRequestBuilder.java:67)
...

the worst thing about this issue is that it hangs 'forever' on MultiSearchRequestBuilder .get() method (call stack of hanging thread below):

sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:274)
org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:113)
org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:45)
org.elasticsearch.action.ActionRequestBuilder.get(ActionRequestBuilder.java:67)
...

context:

  • es version 1.0.0.RC1
  • 3 node cluster
  • 20 indexes (around 2,000,000 documents per index)
  • 12 shards per index
  • 2 replicas
@kimchy
Copy link
Member

kimchy commented Jan 25, 2014

which version are you running?

@kimchy
Copy link
Member

kimchy commented Jan 25, 2014

also, can you paste the full log of the failure, it is cut off.

@kimchy
Copy link
Member

kimchy commented Jan 25, 2014

side note, the rejections are expected, we have a limit on the search thread pool (3x cores), with a limited queue size (1000), so if you will overload (3x cores + 1000), requests will start to get rejected. This is a good thing, since it will make sure the servers are not being overloaded. The fact that its gets stuck, thats weird (And the logs + version would help).

@karol-gwaj
Copy link
Author

full log:

[2014-01-25 02:07:39,613][DEBUG][action.search.type       ] [<node name>] [<index name>][2], node[WTCscW1_R7uA7juvJ1lacg], [R], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@46d6608e] lastShard [true]
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000) on org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$4@3bdb7cad
        at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:62)
        at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
        at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.java:289)
        at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.java:296)
        at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.java:296)
        at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.start(TransportSearchTypeAction.java:190)
        at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction.doExecute(TransportSearchQueryThenFetchAction.java:59)
        at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction.doExecute(TransportSearchQueryThenFetchAction.java:49)
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
        at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:108)
        at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:43)
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
        at org.elasticsearch.action.search.TransportMultiSearchAction.doExecute(TransportMultiSearchAction.java:63)
        at org.elasticsearch.action.search.TransportMultiSearchAction.doExecute(TransportMultiSearchAction.java:39)
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
        at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:92)
        at org.elasticsearch.client.support.AbstractClient.multiSearch(AbstractClient.java:242)
        at org.elasticsearch.action.search.MultiSearchRequestBuilder.doExecute(MultiSearchRequestBuilder.java:79)
        at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:85)
        at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:59)
        at org.elasticsearch.action.ActionRequestBuilder.get(ActionRequestBuilder.java:67)
        ... concealed ...
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)

and im on version 1.0.0.RC1

also i started seeing this problem after switching my code from using post filter to using query, so the code:

MultiSearchRequestBuilder builder = _client.prepareMultiSearch();
for (final String index: indexes)
{
    final SearchRequestBuilder request = _client.prepareSearch(index).setPreference("_local");
    request.setPostFilter(FilterBuilders.termFilter("group_id", id)).setSize(100).setTimeout("10s");
    builder.add(request);
}

was working fine

@orenorgad
Copy link

Hi Shay, I'm running 2 bulk requests (each with 420 requests, but on ES queue that translate into more than 1000 requests) and getting the same exception for the second bulk. That worked fine when running on version 0.90.2, but now that I've upgraded to 0.90.10 I'm getting the exception. Was there a change in the queue size or reject behavior since 0.90.2?

note: the request doesn't hang.

Thanks in advanced,

@tvinod
Copy link

tvinod commented May 1, 2014

I am facing the exact same issue.. so what is the solution or workaround??

@kimchy
Copy link
Member

kimchy commented May 1, 2014

@orenorg yes, the defaults were added post 0.90.2 to set the queue size to make sure the server doesn't get overloaded

@tvinod if you get rejected failures, then you need to add more capacity to your cluster if you expect it to handle such load. Overloading it without protection will just cause it to fall over.

@karol-gwaj sorry to get back late, but did you try to upgrade to latest version and see if it got solved?

@tvinod
Copy link

tvinod commented May 1, 2014

Thanks @kimchy . I understand the rejected failures exception. My issue is that my java call on the client side hangs forever. that shouldn't happen. it should either return an error or throw an exception. im on the latest 1.1.0 version.

@kimchy
Copy link
Member

kimchy commented May 1, 2014

@tvinod if it hangs then its a problem, is there a chance that you can write a repro for this? we will try and repro it as well again...

@tvinod
Copy link

tvinod commented May 1, 2014

i have a 100% repro but i think its because of my setup, amount of data i
have in ES and the client pattern.. i can try my best to see if i can write
a repro for you.
but if there is any instrumentation that you want me to do, either on
client side or ES config side, i can do it.

let me know.

On Thu, May 1, 2014 at 3:38 PM, Shay Banon notifications@github.com wrote:

@tvinod https://github.com/tvinod if it hangs then its a problem, is
there a chance that you can write a repro for this? we will try and repro
it as well again...


Reply to this email directly or view it on GitHubhttps://github.com//issues/4887#issuecomment-41965054
.

@kimchy
Copy link
Member

kimchy commented May 1, 2014

@tvinod its tricky with instrumentation, because of the async nature of ES..., I wrote a very simple program that continuously simulates rejections and it doesn't seem to get stuck, so a repro (you can mail me privately) would go a long way to help solve this.

@tvinod
Copy link

tvinod commented May 1, 2014

ok, ill let you know when i have something for you..

but if it helps - in my case, it hangs when the number of requests in the
multisearch is 26. its not a magic number, it just happens to be in my case.

thanks

On Thu, May 1, 2014 at 3:46 PM, Shay Banon notifications@github.com wrote:

@tvinod https://github.com/tvinod its tricky with instrumentation,
because of the async nature of ES..., I wrote a very simple program that
continuously simulates rejections and it doesn't seem to get stuck, so a
repro (you can mail me privately) would go a long way to help solve this.


Reply to this email directly or view it on GitHubhttps://github.com//issues/4887#issuecomment-41965689
.

@kimchy
Copy link
Member

kimchy commented May 1, 2014

I believe I managed to recreate it, tricky... . Hold on the repro for now, will continue to work on it.

@kimchy kimchy added bug labels May 1, 2014
@kimchy
Copy link
Member

kimchy commented May 1, 2014

I managed to recreate it under certain conditions (very tricky), this is similar in nature to #4526, and it happens because the rejection exception happens on the calling thread, so its not on a forked thread. I will think about how this can be solved, we should fix this case cleanly in ES, but thats a biggish refactoring potentially, will update this issue...

kimchy added a commit to kimchy/elasticsearch that referenced this issue May 4, 2014
When a thread pool rejects the execution on the local node, the search might not return.
This happens due to the fact that we move to the next shard only *within* the execution on the thread pool in the start method. If it fails to submit the task to the thread pool, it will go through the fail shard logic, but without "counting" the current shard itself. When this happens, the relevant shard will then execute more times than intended, causing the total opes counter to skew, and for example, if on another shard the search is successful, the total ops will be incremented *beyond* the expectedTotalOps, causing the check on == as the exit condition to never happen.
The fix here makes sure that the shard iterator properly progresses even in the case of rejections, and also includes improvement to when cleaning a context is sent in case of failures (which were exposed by the test).
Though the change fixes the problem, we should work on simplifying the code path considerably, the first suggestion as a followup is to remove the support for operation threading (also in broadcast), and move the local optimization execution to SearchService, this will simplify the code in different search action considerably, and will allow to remove the problematic #firstOrNull method on the shard iterator.
The second suggestion is to move the optimization of local execution to the TransportService, so all actions will not have to explicitly do the mentioned optimization.
fixes elastic#4887
@kimchy
Copy link
Member

kimchy commented May 4, 2014

I finally found the problem (see pull request above), it didn't relate to #4526 at the end, but wrong management of how we iterate over the shard iterator between copies of the same shard.

@tvinod
Copy link

tvinod commented May 5, 2014

great, any ETA on the fix? whats the best way to get it..
thanks

On Sat, May 3, 2014 at 6:31 PM, Shay Banon notifications@github.com wrote:

I finally found the problem (see pull request above), it didn't relate to
#4526 #4526 at the
end, but wrong management of how we iterate over the shard iterator between
copies of the same shard.


Reply to this email directly or view it on GitHubhttps://github.com//issues/4887#issuecomment-42121255
.

@kimchy kimchy closed this as completed in 342a32f May 5, 2014
kimchy added a commit that referenced this issue May 5, 2014
When a thread pool rejects the execution on the local node, the search might not return.
This happens due to the fact that we move to the next shard only *within* the execution on the thread pool in the start method. If it fails to submit the task to the thread pool, it will go through the fail shard logic, but without "counting" the current shard itself. When this happens, the relevant shard will then execute more times than intended, causing the total opes counter to skew, and for example, if on another shard the search is successful, the total ops will be incremented *beyond* the expectedTotalOps, causing the check on == as the exit condition to never happen.
The fix here makes sure that the shard iterator properly progresses even in the case of rejections, and also includes improvement to when cleaning a context is sent in case of failures (which were exposed by the test).
Though the change fixes the problem, we should work on simplifying the code path considerably, the first suggestion as a followup is to remove the support for operation threading (also in broadcast), and move the local optimization execution to SearchService, this will simplify the code in different search action considerably, and will allow to remove the problematic #firstOrNull method on the shard iterator.
The second suggestion is to move the optimization of local execution to the TransportService, so all actions will not have to explicitly do the mentioned optimization.
fixes #4887
kimchy added a commit that referenced this issue May 5, 2014
When a thread pool rejects the execution on the local node, the search might not return.
This happens due to the fact that we move to the next shard only *within* the execution on the thread pool in the start method. If it fails to submit the task to the thread pool, it will go through the fail shard logic, but without "counting" the current shard itself. When this happens, the relevant shard will then execute more times than intended, causing the total opes counter to skew, and for example, if on another shard the search is successful, the total ops will be incremented *beyond* the expectedTotalOps, causing the check on == as the exit condition to never happen.
The fix here makes sure that the shard iterator properly progresses even in the case of rejections, and also includes improvement to when cleaning a context is sent in case of failures (which were exposed by the test).
Though the change fixes the problem, we should work on simplifying the code path considerably, the first suggestion as a followup is to remove the support for operation threading (also in broadcast), and move the local optimization execution to SearchService, this will simplify the code in different search action considerably, and will allow to remove the problematic #firstOrNull method on the shard iterator.
The second suggestion is to move the optimization of local execution to the TransportService, so all actions will not have to explicitly do the mentioned optimization.
fixes #4887
@kimchy kimchy added the v1.1.2 label May 5, 2014
@billynewport
Copy link

I'm stuck on this now also. I'm running 1.1.1 ES server and 1.0.2 java client. It just hangs in the client on actionGet(). Is this a server or client bug? It was working but when I increased the amount I'm bulking, the problem started. Any ETA?

mateuszkaczynski added a commit to arachnys/elasticsearch that referenced this issue May 16, 2014
@situ2011
Copy link

I got this too, any solution? I'm running es 1.4.0 and java 1.7

@clintongormley
Copy link

@situ2011 this was fixed in 1.1.2. If you're seeing something similar please open a new issue with all necessary details.

mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015
When a thread pool rejects the execution on the local node, the search might not return.
This happens due to the fact that we move to the next shard only *within* the execution on the thread pool in the start method. If it fails to submit the task to the thread pool, it will go through the fail shard logic, but without "counting" the current shard itself. When this happens, the relevant shard will then execute more times than intended, causing the total opes counter to skew, and for example, if on another shard the search is successful, the total ops will be incremented *beyond* the expectedTotalOps, causing the check on == as the exit condition to never happen.
The fix here makes sure that the shard iterator properly progresses even in the case of rejections, and also includes improvement to when cleaning a context is sent in case of failures (which were exposed by the test).
Though the change fixes the problem, we should work on simplifying the code path considerably, the first suggestion as a followup is to remove the support for operation threading (also in broadcast), and move the local optimization execution to SearchService, this will simplify the code in different search action considerably, and will allow to remove the problematic #firstOrNull method on the shard iterator.
The second suggestion is to move the optimization of local execution to the TransportService, so all actions will not have to explicitly do the mentioned optimization.
fixes elastic#4887
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants