Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search might not return on thread pool rejection #6032

Closed
wants to merge 1 commit into from

Conversation

kimchy
Copy link
Member

@kimchy kimchy commented May 4, 2014

When a thread pool rejects the execution on the local node, the search might not return.
This happens due to the fact that we move to the next shard only within the execution on the thread pool in the start method. If it fails to submit the task to the thread pool, it will go through the fail shard logic, but without "counting" the current shard itself. When this happens, the relevant shard will then execute more times than intended, causing the total opes counter to skew, and for example, if on another shard the search is successful, the total ops will be incremented beyond the expectedTotalOps, causing the check on == as the exit condition to never happen.
The fix here makes sure that the shard iterator properly progresses even in the case of rejections, and also includes improvement to when cleaning a context is sent in case of failures (which were exposed by the test).
Though the change fixes the problem, we should work on simplifying the code path considerably, the first suggestion as a followup is to remove the support for operation threading (also in broadcast), and move the local optimization execution to SearchService, this will simplify the code in different search action considerably, and will allow to remove the problematic #firstOrNull method on the shard iterator.
The second suggestion is to move the optimization of local execution to the TransportService, so all actions will not have to explicitly do the mentioned optimization.
fixes #4887

When a thread pool rejects the execution on the local node, the search might not return.
This happens due to the fact that we move to the next shard only *within* the execution on the thread pool in the start method. If it fails to submit the task to the thread pool, it will go through the fail shard logic, but without "counting" the current shard itself. When this happens, the relevant shard will then execute more times than intended, causing the total opes counter to skew, and for example, if on another shard the search is successful, the total ops will be incremented *beyond* the expectedTotalOps, causing the check on == as the exit condition to never happen.
The fix here makes sure that the shard iterator properly progresses even in the case of rejections, and also includes improvement to when cleaning a context is sent in case of failures (which were exposed by the test).
Though the change fixes the problem, we should work on simplifying the code path considerably, the first suggestion as a followup is to remove the support for operation threading (also in broadcast), and move the local optimization execution to SearchService, this will simplify the code in different search action considerably, and will allow to remove the problematic #firstOrNull method on the shard iterator.
The second suggestion is to move the optimization of local execution to the TransportService, so all actions will not have to explicitly do the mentioned optimization.
fixes elastic#4887
@kimchy kimchy added bug labels May 4, 2014
@clintongormley
Copy link

@kimchy Could #5997 be related to this?

@kimchy
Copy link
Member Author

kimchy commented May 4, 2014

@clintongormley doesn't look like it at first glance...

logger.trace("failed to release context", t1);
}
}
listener.onFailure(t);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe put the for loop and onFailure call into a try final block?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see a reason where this will require a try .., finally, since the calls to release the context are properly protected. I think its enough?

@martijnvg
Copy link
Member

Sneaky bug! Left one minor comment, but other than that LGTM.

@kimchy kimchy added the v1.1.2 label May 5, 2014
@kimchy kimchy closed this May 5, 2014
@kimchy kimchy deleted the search_not_returning_on_reject branch May 5, 2014 07:26
@megastef
Copy link

In which version is this fixed? I've got a 1.1.2 and a single node.js process with async queries is running into the issue. Could you pls. recommend me quickly an alternative (1.2 got again breaking changes ...)

@kimchy
Copy link
Member Author

kimchy commented Jun 19, 2014

this is fixed in 1.1.2, are you sure you are running into this problem and not something else maybe? Can you open a new issue so we can track it, with a recreation if you can which would help tremendously

@clintongormley clintongormley added the :Search/Search Search-related issues that do not fall into other categories label Jun 7, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Search Search-related issues that do not fall into other categories v1.1.2 v1.2.0 v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MultiSearch hangs forever + EsRejectedExecutionException
4 participants