You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is very costly for high-cardinality fields, because it means we lookup ord->term for every single one. Instead, can we use a PriorityQueue of "OrdAndSortValue", find the top-N, and then only at the end, lookup ord->term for the top-N? E.g. in solr faceting this OrdAndSortValue is just a long (32 bits ord and 32 bits count) but the representation is less important here.
The text was updated successfully, but these errors were encountered:
I experimented some more, this is only one perf problem. The other one is the massive amount of Bucket objects created (maxOrd).
We should do this more like a lucene collector: use the PQ more efficiently so we only create queue_size Bucket objects. With high cardinality fields, this method is just as much of a hotspot as collecting ordinals so we should optimize it as such: at least as a step avoid creating Bucket objects if it can't compete with the PQ, but maybe later try to optimize the loop with sentinels and stuff.
spinscale
changed the title
GlobalOrdinalsStringTermsAggregator is inefficient for high-cardinality fields
Aggregations: GlobalOrdinalsStringTermsAggregator is inefficient for high-cardinality fields
Jun 18, 2014
clintongormley
changed the title
Aggregations: GlobalOrdinalsStringTermsAggregator is inefficient for high-cardinality fields
GlobalOrdinalsStringTermsAggregator is inefficient for high-cardinality fields
Jun 7, 2015
After doing some profiling and seeing surprising results, and looking at the code, and I can easily be reading it wrong...
buildAggregation() has the following loop:
This is very costly for high-cardinality fields, because it means we lookup ord->term for every single one. Instead, can we use a PriorityQueue of "OrdAndSortValue", find the top-N, and then only at the end, lookup ord->term for the top-N? E.g. in solr faceting this OrdAndSortValue is just a long (32 bits ord and 32 bits count) but the representation is less important here.
The text was updated successfully, but these errors were encountered: