Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to early terminate document collection when searching/counting #6885

Merged
merged 1 commit into from Jul 23, 2014

Conversation

areek
Copy link
Contributor

@areek areek commented Jul 15, 2014

The idea is to add an option which will let the user control the number of matched documents collected per shard before scoring (if applicable). This will be helpful for exists-type functionality or when not all matched document has to be scored.

Closes #6876

@areek
Copy link
Contributor Author

areek commented Jul 15, 2014

TODO:

  • replace all usage of Lucene.ExistsCollector with the new Lucene.EarlyTerminatingCountCollector to take advantage of early termination.
  • use threshold for collation in PhraseSuggester

@@ -240,6 +259,10 @@ public void writeTo(StreamOutput out) throws IOException {
out.writeOptionalString(preference);
out.writeBytesReference(source);
out.writeStringArray(types);

if (out.getVersion().onOrAfter(Version.V_1_4_0)) {
out.writeVInt(threshold);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you use vint you can't write negative values :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just go with threshold + 1 on the write side and threshold - 1 on the read side

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to just change the default to 0, it does not really make sense for the user to set it to 0. maybe even throw an exception when they want to set it.

@s1monw
Copy link
Contributor

s1monw commented Jul 15, 2014

left some comments - I like it...

@areek
Copy link
Contributor Author

areek commented Jul 16, 2014

Updated PR:

  • incorporated feedback
  • replaced Lucene.ExistsCollector usage with the new EarlyTerminatingCountCollector (this might speed things up a little bit in the Percolator service)

TODO:

  • use threshold param in PhraseSuggester collation.

It seems like there is no equivalent of MultiSearchRequest and family for CountRequest. To use threshold for PhraseSuggester collation, it would be ideal to use CountRequest (for setting up the threshold param) and also use something similar to MultiSearchRequest for CountRequest (to dispatch requests for all the generated phrases).

Another way would be to somehow add the count-specific param threshold to the SearchRequest. Thoughts?

}

@Override
public void setScorer(Scorer scorer) throws IOException {
this.exists = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on removing this

@areek areek self-assigned this Jul 16, 2014
@areek
Copy link
Contributor Author

areek commented Jul 18, 2014

I have added a new terminate_after_count parameter to search and count API, which represents the number of documents to collect per shard. Upon reaching that number, the query will early terminate.

The search request with the new param looks as follows:

curl -XGET 'localhost:9200/_search' -d '
  { 
    terminate_after_count: 1, 
    "query" : ...
  }'

or

curl -XGET 'localhost:9200/_search?terminate_after_count=1' -d '
  { 
    "query" : ...
  }'

and for Count API, the request is as follows:

curl -XGET 'localhost:9200/_count?terminate_after_count=1' -d '
  { 
    "query" : ...
  }'

Both the Search & Count Response will have an added field terminated_after_count which will be true if any of the shards actually early terminated.

@areek areek changed the title Implement early termination for Count API, if threshold is provided Search & Count: Add early termination of document collection option Jul 18, 2014
@areek
Copy link
Contributor Author

areek commented Jul 18, 2014

@clintongormley any thoughts on the naming/api for this?

@areek
Copy link
Contributor Author

areek commented Jul 18, 2014

Updated PR:

  • add support for early termination to the Search API
  • use early termination in collation for PhraseSuggester
  • Added tests
  • Added terminated_after_count to Search and Count response, to indicate if any of the shards actually early terminated query exec. or not

TODO:

  • add docs for new option in Search API. (not exactly sure where would the best place to add it though)
  • add more tests for new option in Search API

@areek areek added the review label Jul 18, 2014
@areek areek changed the title Search & Count: Add early termination of document collection option Search & Count: Add an option to early terminate document collection Jul 18, 2014
@clintongormley
Copy link

@areek perhaps terminate_after, just to keep it short and simple. Wondering if the response should be true/false, or if it would be better to return terminated_after: 1000 (and leave out the key if it wasn't terminated early)

@areek
Copy link
Contributor Author

areek commented Jul 18, 2014

@clintongormley the problem with terminate_after is that its a little ambiguous, users might think it is a timeout or what not, rather than a threshold to doc collection. Though terminate_after is much nicer then terminate_after_count.
Regarding the response, I don't see the value in specifying the terminated_after count, as if the early termination did happen then this will always be the same as the input param. Specifying it also might confuse users as this is a threshold/shard, it might report num_shards*terminate_after_count hits/counts while having terminated_after: terminate_after_count in the resp.
I think terminate_after is a better name but a little ambiguous (might just roll with it) and for the response I will leave terminated_after out if it was false.
Thoughts?

@clintongormley
Copy link

@areek i agree with you about the terminate_after count. True/false does make more sense. But if the user does specify terminate_after, I think the flag should be in the response. Perhaps terminated? or terminated_early? Neither of those are perfect, but terminated_after feels like it is missing something...

@areek
Copy link
Contributor Author

areek commented Jul 22, 2014

Updated PR:

  • The new request param has been renamed to terminate_after from terminate_after_count.
  • Added documentation
  • The response for both search & count api looks like the following:
  {
    "took": 1,
    "timed_out": false,
    "terminated_early": true,
    "_shards": {
       ...
    },
    "hits": {
       ...
    }
}

The terminated_early will only be visible when the user explicitly sets terminate_after param in the request

@areek
Copy link
Contributor Author

areek commented Jul 22, 2014

@s1monw It would be awesome to have this reviewed.

return this;
}

public int terminateAfterCount() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually call the getter also terminateAfter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops refactoring miss. will fix

@s1monw
Copy link
Contributor

s1monw commented Jul 22, 2014

I left a couple of comments but it looks great so far

@s1monw s1monw removed the review label Jul 22, 2014
@@ -63,6 +63,9 @@ query.

|default_operator |The default operator to be used, can be `AND` or
`OR`. Defaults to `OR`.

|terminate_after |The maximum count for each shard, upon reaching which the query
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add coming[1.4.0] here I guess?

@@ -62,6 +62,10 @@ that point when expired. Defaults to no timeout.
`query_and_fetch`. Defaults to `query_then_fetch`. See
<<search-request-search-type,_Search Type_>> for
more details on the different types of search that can be performed.

|`terminate_after` |The maximum number of documents to collect for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add coming[1.4.0] here I guess?

@clintongormley
Copy link

@areek Docs look good, I'd just add a mention of the key in the response when the request has been terminated early

@s1monw
Copy link
Contributor

s1monw commented Jul 23, 2014

did another review and this LGTM too

@areek
Copy link
Contributor Author

areek commented Jul 23, 2014

Mentioned the added key terminated_early in response when terminate_after is set. I will merge this soon.

Allow users to control document collection termination, if a specified terminate_after number is
set. Upon setting the newly added parameter, the response will include a boolean terminated_early
flag, indicating if the document collection for any shard terminated early.

closes elastic#6876
@areek areek merged commit 5487c56 into elastic:master Jul 23, 2014
@areek areek deleted the enhancement/6876 branch July 23, 2014 19:21
@clintongormley clintongormley changed the title Search & Count: Add an option to early terminate document collection Search: Add an option to early terminate document collection when searching/counting Sep 10, 2014
Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Dec 11, 2014
gmarz added a commit to elastic/elasticsearch-net that referenced this pull request Dec 11, 2014
Mpdreamz added a commit that referenced this pull request Dec 17, 2014
Mpdreamz added a commit that referenced this pull request Dec 17, 2014
@clintongormley clintongormley added the :Search/Search Search-related issues that do not fall into other categories label Jun 6, 2015
@clintongormley clintongormley changed the title Search: Add an option to early terminate document collection when searching/counting Add an option to early terminate document collection when searching/counting Jun 6, 2015
mute pushed a commit to mute/elasticsearch that referenced this pull request Jul 29, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature release highlight :Search/Search Search-related issues that do not fall into other categories v1.4.0.Beta1 v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add option to Count API to terminate early if a specified count is reached
4 participants