Search might not return on thread pool rejection #6032

kimchy · 2014-05-04T01:09:20Z

When a thread pool rejects the execution on the local node, the search might not return.
This happens due to the fact that we move to the next shard only within the execution on the thread pool in the start method. If it fails to submit the task to the thread pool, it will go through the fail shard logic, but without "counting" the current shard itself. When this happens, the relevant shard will then execute more times than intended, causing the total opes counter to skew, and for example, if on another shard the search is successful, the total ops will be incremented beyond the expectedTotalOps, causing the check on == as the exit condition to never happen.
The fix here makes sure that the shard iterator properly progresses even in the case of rejections, and also includes improvement to when cleaning a context is sent in case of failures (which were exposed by the test).
Though the change fixes the problem, we should work on simplifying the code path considerably, the first suggestion as a followup is to remove the support for operation threading (also in broadcast), and move the local optimization execution to SearchService, this will simplify the code in different search action considerably, and will allow to remove the problematic #firstOrNull method on the shard iterator.
The second suggestion is to move the optimization of local execution to the TransportService, so all actions will not have to explicitly do the mentioned optimization.
fixes #4887

When a thread pool rejects the execution on the local node, the search might not return. This happens due to the fact that we move to the next shard only *within* the execution on the thread pool in the start method. If it fails to submit the task to the thread pool, it will go through the fail shard logic, but without "counting" the current shard itself. When this happens, the relevant shard will then execute more times than intended, causing the total opes counter to skew, and for example, if on another shard the search is successful, the total ops will be incremented *beyond* the expectedTotalOps, causing the check on == as the exit condition to never happen. The fix here makes sure that the shard iterator properly progresses even in the case of rejections, and also includes improvement to when cleaning a context is sent in case of failures (which were exposed by the test). Though the change fixes the problem, we should work on simplifying the code path considerably, the first suggestion as a followup is to remove the support for operation threading (also in broadcast), and move the local optimization execution to SearchService, this will simplify the code in different search action considerably, and will allow to remove the problematic #firstOrNull method on the shard iterator. The second suggestion is to move the optimization of local execution to the TransportService, so all actions will not have to explicitly do the mentioned optimization. fixes elastic#4887

clintongormley · 2014-05-04T08:24:59Z

@kimchy Could #5997 be related to this?

kimchy · 2014-05-04T16:43:14Z

@clintongormley doesn't look like it at first glance...

martijnvg · 2014-05-04T16:56:56Z

src/main/java/org/elasticsearch/action/search/type/TransportSearchTypeAction.java

+                    logger.trace("failed to release context", t1);
+                }
+            }
+            listener.onFailure(t);


maybe put the for loop and onFailure call into a try final block?

I didn't see a reason where this will require a try .., finally, since the calls to release the context are properly protected. I think its enough?

martijnvg · 2014-05-04T17:29:21Z

Sneaky bug! Left one minor comment, but other than that LGTM.

megastef · 2014-05-31T22:24:38Z

In which version is this fixed? I've got a 1.1.2 and a single node.js process with async queries is running into the issue. Could you pls. recommend me quickly an alternative (1.2 got again breaking changes ...)

kimchy · 2014-06-19T11:46:26Z

this is fixed in 1.1.2, are you sure you are running into this problem and not something else maybe? Can you open a new issue so we can track it, with a recreation if you can which would help tremendously

kimchy added bug labels May 4, 2014

martijnvg reviewed May 4, 2014
View reviewed changes

kimchy added the v1.1.2 label May 5, 2014

kimchy closed this May 5, 2014

kimchy deleted the search_not_returning_on_reject branch May 5, 2014 07:26

clintongormley added the :Search/Search Search-related issues that do not fall into other categories label Jun 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search might not return on thread pool rejection #6032

Search might not return on thread pool rejection #6032

kimchy commented May 4, 2014

clintongormley commented May 4, 2014

kimchy commented May 4, 2014

martijnvg May 4, 2014

kimchy May 4, 2014

martijnvg commented May 4, 2014

megastef commented May 31, 2014

kimchy commented Jun 19, 2014

Search might not return on thread pool rejection #6032

Search might not return on thread pool rejection #6032

Conversation

kimchy commented May 4, 2014

clintongormley commented May 4, 2014

kimchy commented May 4, 2014

martijnvg May 4, 2014

Choose a reason for hiding this comment

kimchy May 4, 2014

Choose a reason for hiding this comment

martijnvg commented May 4, 2014

megastef commented May 31, 2014

kimchy commented Jun 19, 2014