Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add from support to top_hits aggregator. #6299

Closed
martijnvg opened this issue May 23, 2014 · 26 comments
Closed

Add from support to top_hits aggregator. #6299

martijnvg opened this issue May 23, 2014 · 26 comments

Comments

@martijnvg
Copy link
Member

No description provided.

@jpountz
Copy link
Contributor

jpountz commented May 23, 2014

+1

@Kumen
Copy link

Kumen commented May 26, 2014

pagination for top_hits aggregator, definitely +1

@artemredkin
Copy link

Judging by commit a989229, it seem that you added pagination for hits, but what about groups (buckets) themselves? For example, if i group books by book author and search query returns 9 unique authors, can I show first page with 5 authors and next page with other 4 authors?

@Kumen
Copy link

Kumen commented May 27, 2014

already tested, your assumption is correct. we need pagination support for buckets.

@jpountz
Copy link
Contributor

jpountz commented May 27, 2014

Paging is tricky. We might be able to expose it when sorting by term (would it work for you?), but if you are sorting by counts or by sub aggregation, then #1305 would make counts wrong and ordering inconsistent across pages.

@Kumen
Copy link

Kumen commented May 27, 2014

yes, that would work for me. please expose this feature.

@artemredkin
Copy link

I need at least sort by term and sort by docs count. Is approximate count/paging possible?

@jpountz
Copy link
Contributor

jpountz commented May 27, 2014

I'm a bit reluctant to add paging support when sorting by counts given that it would give more accurate results on the 1st result of the 2nd page than on the last one of the 1st page. Please also note that the way it would work behind the scenes would not be different from what you could do on client side by first requesting 10 buckets for the 1st page and then 20 for the 2nd page and disregard the first 10 buckets, etc. for subsequent pages.

@artemredkin
Copy link

Yes, it can be implemented on client (in my system it already is, but it's not very efficient). If there is only one way to sort/page buckets, then clients can either decide in runtime which implementation to use (which requires a lot of code and conditions) or implement grouping themselves altogether. Maybe I can help with some experiments to see if it can be done at all?

@jpountz
Copy link
Contributor

jpountz commented May 27, 2014

I think this feature would be easy to implement, what I'm more concerned about here is to expose a feature that would be error-prone. :(

@Kumen
Copy link

Kumen commented May 27, 2014

In our system it is already implemented. But we have memory issues on requesting all possible buckets, therefore we need a way to effectively navigate through the buckets on server side. i thought it would help to get the buckets paged from the server.

edit: the memory issues are on the elasticsearch server not on the client side

@artemredkin
Copy link

You already provide at least one feature, that can be inaccurate - cardinality aggregations. Also, from documentation about terms aggregation:
The higher the requested size is, the more accurate the results will be, but also, the more expensive it will be to compute the final results (both due to bigger priority queues that are managed on a shard level and due to bigger data transfers between the nodes and the client).
So you also have an error-prone feature as well (clients can set size to 0 on high cardinality field and shoot themselves in the foot).
Another consideration, even in case of simple solution you proposed (dropping n-1 pages on elasticsearch side) can be advantageous, since we can run elasticsearch on more powerful machines, then our backends.

@jpountz
Copy link
Contributor

jpountz commented May 27, 2014

This feature already has accuracy issues indeed, but in my opinion paging will make it even worse. For example, let's imagine that your top terms are term1, term2, ..., term10. If your page size is 5, it could happen that Elasticsearch returns term1, term2, term3, term4 and term6 on the first page (6 instead of 5 because of inaccuracy), and then term6, term7, term8, term9 and term10 (as expected). So you would have one term that would be completely invisible to your users (term5) and another one that would appear twice (term6). I think this is too confusing.

@artemredkin
Copy link

Too bad, without counts sort this feature is only half useful. Can this issue be re-evaluated as a separate task, connected to the issue #256?

@artemredkin
Copy link

And how about this: you provide java interface for sorter, and through plugin, we can add our own sorters for possibly-innacurate results?

@Kumen
Copy link

Kumen commented May 28, 2014

in my concrete problem. i have million of documents that i want to aggregate by key. the elasticsearch fails at this point. in the worst possible case there are over 200.000 buckets and per bucket about 10 matched documents. i thought a effective way to minimize the memory consumption is it to page through the buckets on server side. is there any other solution except to enlarge the ram capacity ?

@artemredkin
Copy link

Also, have you seen SolrCloud's implementation of facets? Something like: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201209.mbox/%3Calpine.DEB.2.02.1209261450570.2316@frisbee%3E

@kimchy
Copy link
Member

kimchy commented May 28, 2014

@artemredkin there is a difference between returning exact counts for specific terms, and guaranteeing total ordering. The first can be done with another round after picking the top N, the second can not.

@jpountz
Copy link
Contributor

jpountz commented May 28, 2014

@Kumen memory usage currently depends mostly on your number of buckets, not the size of the page that is requested. If you are not running Elasticsearch 1.2 yet, I would recommend on upgrading as memory usage of the terms aggregation improved significantly in this release.

@Kumen
Copy link

Kumen commented May 28, 2014

I am currently using the version 1.3 (manual build from branch).
Thus i need more memory.

thanks

@artemredkin
Copy link

@kimchy I may be horribly wrong here (will do more digging today), but solr's field collapsing works in distributed environment and provides paging/ordering (at least i do not see in their documentation any indication, that it is not supported). Plus, @jpountz pointed to #1305 as a source for ordering problem.
In your opinion, can this problem (ordering of groups by count) be solved at all (leveraging cardinatlity agg, for example)? Maybe in later releases?

@kimchy
Copy link
Member

kimchy commented May 28, 2014

@artemredkin guaranteed total ordering can't be solved (with a 2 way execution) unless all the values are streamed, so by definition, pagination will not be "exact", that is the problem. You can say that for the top N (or paginated N), the count for each term will be exact by executing another round, but not the total order of them.

It is an interesting problem, specifically with the fact that we would love to solve it in a somewhat performant manner. We obviously would love to solve it, if we manage to come up with a way to do it that is :)

@artemredkin
Copy link

@kimchy Aren't top N groups exactly what we need for pagination? I was going to implement in on client in 2 hops (it would be 3-way execution, yes?). On first - get (1.5_size of group page) worth of terms, maybe with cardinality, sorting them, and on second hop - get those terms with top_hits. For second page - get (3_size of group page) worth of terms and so on. Is it wrong :) ? Or just slow for inclusion inside elasticsearch itself?
Anyway, thanks for explaining things, it would be awesome, it you solve this :).

martijnvg added a commit that referenced this issue May 30, 2014
@martijnvg
Copy link
Member Author

I didn't mean to stop this discussion by closing this issue...

Adding pagination in the terms aggregation is tricky like @kimchy and @jpountz describe and the correctness would depend on the ordering. The correctness of the ordering depends to what order the terms aggregation is set to. If the terms aggregation's order is set to _term or to specific inner metric aggregations (min or max metric bucket, but not avg metric bucket), the ordering is correct.

The 'result grouping' approach in ES relies on the terms aggregation to determine the correct groups and an inner max aggregation for ordering of the groups:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-top-hits-aggregation.html#_field_collapse_example

Pagination can be simulated by using terms aggregation's exclude option. On subsequent search requests the previous emitted term buckets should be added to the exclude option, this way previous seen groups don't end up in the next aggregation response.

@artemredkin
Copy link

Hm,

or to specific inner metric aggregations
does this mean, that I can use 'cardinality' sub-agg to sort term groups?

@martijnvg
Copy link
Member Author

does this mean, that I can use 'cardinality' sub-agg to sort term groups?

Yes, you can sort by a cardinality inner metric aggregation, but the ordering of the buckets depend on the accuracy of the cardinality aggregation.

@clintongormley clintongormley changed the title Add from support to top_hits aggregator. Aggregations: Add from support to top_hits aggregator. Jul 16, 2014
@clintongormley clintongormley changed the title Aggregations: Add from support to top_hits aggregator. Add from support to top_hits aggregator. Jun 7, 2015
@colings86 colings86 added :Analytics/Aggregations Aggregations and removed :Analytics/Aggregations Aggregations labels Mar 31, 2017
dpblh pushed a commit to dpblh/elastic4s that referenced this issue Jun 5, 2018
was added pagination to top_hits aggregation
sksamuel pushed a commit to Philippus/elastic4s that referenced this issue Jun 13, 2018
was added pagination to top_hits aggregation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants