New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MultiSearch hangs forever + EsRejectedExecutionException #4887
Comments
which version are you running? |
also, can you paste the full log of the failure, it is cut off. |
side note, the rejections are expected, we have a limit on the search thread pool (3x cores), with a limited queue size (1000), so if you will overload (3x cores + 1000), requests will start to get rejected. This is a good thing, since it will make sure the servers are not being overloaded. The fact that its gets stuck, thats weird (And the logs + version would help). |
full log:
and im on version 1.0.0.RC1 also i started seeing this problem after switching my code from using post filter to using query, so the code:
was working fine |
Hi Shay, I'm running 2 bulk requests (each with 420 requests, but on ES queue that translate into more than 1000 requests) and getting the same exception for the second bulk. That worked fine when running on version 0.90.2, but now that I've upgraded to 0.90.10 I'm getting the exception. Was there a change in the queue size or reject behavior since 0.90.2? note: the request doesn't hang. Thanks in advanced, |
I am facing the exact same issue.. so what is the solution or workaround?? |
@orenorg yes, the defaults were added post 0.90.2 to set the queue size to make sure the server doesn't get overloaded @tvinod if you get rejected failures, then you need to add more capacity to your cluster if you expect it to handle such load. Overloading it without protection will just cause it to fall over. @karol-gwaj sorry to get back late, but did you try to upgrade to latest version and see if it got solved? |
Thanks @kimchy . I understand the rejected failures exception. My issue is that my java call on the client side hangs forever. that shouldn't happen. it should either return an error or throw an exception. im on the latest 1.1.0 version. |
@tvinod if it hangs then its a problem, is there a chance that you can write a repro for this? we will try and repro it as well again... |
i have a 100% repro but i think its because of my setup, amount of data i let me know. On Thu, May 1, 2014 at 3:38 PM, Shay Banon notifications@github.com wrote:
|
@tvinod its tricky with instrumentation, because of the async nature of ES..., I wrote a very simple program that continuously simulates rejections and it doesn't seem to get stuck, so a repro (you can mail me privately) would go a long way to help solve this. |
ok, ill let you know when i have something for you.. but if it helps - in my case, it hangs when the number of requests in the thanks On Thu, May 1, 2014 at 3:46 PM, Shay Banon notifications@github.com wrote:
|
I believe I managed to recreate it, tricky... . Hold on the repro for now, will continue to work on it. |
I managed to recreate it under certain conditions (very tricky), this is similar in nature to #4526, and it happens because the rejection exception happens on the calling thread, so its not on a forked thread. I will think about how this can be solved, we should fix this case cleanly in ES, but thats a biggish refactoring potentially, will update this issue... |
When a thread pool rejects the execution on the local node, the search might not return. This happens due to the fact that we move to the next shard only *within* the execution on the thread pool in the start method. If it fails to submit the task to the thread pool, it will go through the fail shard logic, but without "counting" the current shard itself. When this happens, the relevant shard will then execute more times than intended, causing the total opes counter to skew, and for example, if on another shard the search is successful, the total ops will be incremented *beyond* the expectedTotalOps, causing the check on == as the exit condition to never happen. The fix here makes sure that the shard iterator properly progresses even in the case of rejections, and also includes improvement to when cleaning a context is sent in case of failures (which were exposed by the test). Though the change fixes the problem, we should work on simplifying the code path considerably, the first suggestion as a followup is to remove the support for operation threading (also in broadcast), and move the local optimization execution to SearchService, this will simplify the code in different search action considerably, and will allow to remove the problematic #firstOrNull method on the shard iterator. The second suggestion is to move the optimization of local execution to the TransportService, so all actions will not have to explicitly do the mentioned optimization. fixes elastic#4887
I finally found the problem (see pull request above), it didn't relate to #4526 at the end, but wrong management of how we iterate over the shard iterator between copies of the same shard. |
great, any ETA on the fix? whats the best way to get it.. On Sat, May 3, 2014 at 6:31 PM, Shay Banon notifications@github.com wrote:
|
When a thread pool rejects the execution on the local node, the search might not return. This happens due to the fact that we move to the next shard only *within* the execution on the thread pool in the start method. If it fails to submit the task to the thread pool, it will go through the fail shard logic, but without "counting" the current shard itself. When this happens, the relevant shard will then execute more times than intended, causing the total opes counter to skew, and for example, if on another shard the search is successful, the total ops will be incremented *beyond* the expectedTotalOps, causing the check on == as the exit condition to never happen. The fix here makes sure that the shard iterator properly progresses even in the case of rejections, and also includes improvement to when cleaning a context is sent in case of failures (which were exposed by the test). Though the change fixes the problem, we should work on simplifying the code path considerably, the first suggestion as a followup is to remove the support for operation threading (also in broadcast), and move the local optimization execution to SearchService, this will simplify the code in different search action considerably, and will allow to remove the problematic #firstOrNull method on the shard iterator. The second suggestion is to move the optimization of local execution to the TransportService, so all actions will not have to explicitly do the mentioned optimization. fixes #4887
When a thread pool rejects the execution on the local node, the search might not return. This happens due to the fact that we move to the next shard only *within* the execution on the thread pool in the start method. If it fails to submit the task to the thread pool, it will go through the fail shard logic, but without "counting" the current shard itself. When this happens, the relevant shard will then execute more times than intended, causing the total opes counter to skew, and for example, if on another shard the search is successful, the total ops will be incremented *beyond* the expectedTotalOps, causing the check on == as the exit condition to never happen. The fix here makes sure that the shard iterator properly progresses even in the case of rejections, and also includes improvement to when cleaning a context is sent in case of failures (which were exposed by the test). Though the change fixes the problem, we should work on simplifying the code path considerably, the first suggestion as a followup is to remove the support for operation threading (also in broadcast), and move the local optimization execution to SearchService, this will simplify the code in different search action considerably, and will allow to remove the problematic #firstOrNull method on the shard iterator. The second suggestion is to move the optimization of local execution to the TransportService, so all actions will not have to explicitly do the mentioned optimization. fixes #4887
I'm stuck on this now also. I'm running 1.1.1 ES server and 1.0.2 java client. It just hangs in the client on actionGet(). Is this a server or client bug? It was working but when I increased the amount I'm bulking, the problem started. Any ETA? |
I got this too, any solution? I'm running es 1.4.0 and java 1.7 |
@situ2011 this was fixed in 1.1.2. If you're seeing something similar please open a new issue with all necessary details. |
When a thread pool rejects the execution on the local node, the search might not return. This happens due to the fact that we move to the next shard only *within* the execution on the thread pool in the start method. If it fails to submit the task to the thread pool, it will go through the fail shard logic, but without "counting" the current shard itself. When this happens, the relevant shard will then execute more times than intended, causing the total opes counter to skew, and for example, if on another shard the search is successful, the total ops will be incremented *beyond* the expectedTotalOps, causing the check on == as the exit condition to never happen. The fix here makes sure that the shard iterator properly progresses even in the case of rejections, and also includes improvement to when cleaning a context is sent in case of failures (which were exposed by the test). Though the change fixes the problem, we should work on simplifying the code path considerably, the first suggestion as a followup is to remove the support for operation threading (also in broadcast), and move the local optimization execution to SearchService, this will simplify the code in different search action considerably, and will allow to remove the problematic #firstOrNull method on the shard iterator. The second suggestion is to move the optimization of local execution to the TransportService, so all actions will not have to explicitly do the mentioned optimization. fixes elastic#4887
im getting this error logged after executing many multisearch requests concurrently (snippet below for single multisearch):
exception call stack:
the worst thing about this issue is that it hangs 'forever' on MultiSearchRequestBuilder .get() method (call stack of hanging thread below):
context:
The text was updated successfully, but these errors were encountered: