Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for shard_size for terms & terms_stats facets #3821

Closed
uboness opened this issue Oct 2, 2013 · 0 comments
Closed

Add support for shard_size for terms & terms_stats facets #3821

uboness opened this issue Oct 2, 2013 · 0 comments

Comments

@uboness
Copy link
Contributor

uboness commented Oct 2, 2013

shard_size will enable to increase the accuracy of the returned term entries.

The size parameter defines how many top terms should be returned out
of the overall terms list. By default, the node coordinating the
search process will ask each shard to provide its own top size terms
and once all shards respond, it will reduces the results to the final list
that will then be sent back to the client. This means that if the number
of unique terms is greater than size, the returned list is slightly off
and not accurate (it could be that the term counts are slightly off and it
could even be that a term that should have been in the top size entries
was not returned).

The higher the requested size is, the more accurate the results will be,
but also, the more expensive it will be to compute the final results (both
due to bigger priority queues that are managed on a shard level and due to
bigger data transfers between the nodes and the client). In an attempt to
minimize the extra work that comes with bigger requested size we a
shard_size parameter was introduced. The once defined, it will determine
how many terms the coordinating node is requesting from each shard. Once
all the shards responded, the coordinating node will then reduce them
to a final result which will be based on the size parameter - this way,
once can increase the accuracy of the returned terms and avoid the overhead
of streaming a big list of terms back to the client.

Note that shard_size cannot be smaller than size... if that's the case
elasticsearch will override it and reset it to be equal to size.

@uboness uboness closed this as completed in f3c6108 Oct 2, 2013
uboness added a commit that referenced this issue Oct 2, 2013
…he "shard_size" is the number of term entries each shard will send back to the coordinating node. "shard_size" > "size" will increase the accuracy (both in terms of the counts associated with each term and the terms that will actually be returned the user) - of course, the higher "shard_size" is, the more expensive the processing becomes as bigger queues are maintained on a shard level and larger lists are streamed back from the shards.

closes #3821
mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015
…he "shard_size" is the number of term entries each shard will send back to the coordinating node. "shard_size" > "size" will increase the accuracy (both in terms of the counts associated with each term and the terms that will actually be returned the user) - of course, the higher "shard_size" is, the more expensive the processing becomes as bigger queues are maintained on a shard level and larger lists are streamed back from the shards.

closes elastic#3821
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant