Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GlobalOrdinalsStringTermsAggregator is inefficient for high-cardinality fields #6518

Closed
rmuir opened this issue Jun 16, 2014 · 1 comment
Closed

Comments

@rmuir
Copy link
Contributor

rmuir commented Jun 16, 2014

After doing some profiling and seeing surprising results, and looking at the code, and I can easily be reading it wrong...

buildAggregation() has the following loop:

for (long globalTermOrd = Ordinals.MIN_ORDINAL; globalTermOrd < globalOrdinals.getMaxOrd(); ++globalTermOrd) {
                ...
                copy(globalValues.getValueByOrd(globalTermOrd), spare.termBytes);  
}

This is very costly for high-cardinality fields, because it means we lookup ord->term for every single one. Instead, can we use a PriorityQueue of "OrdAndSortValue", find the top-N, and then only at the end, lookup ord->term for the top-N? E.g. in solr faceting this OrdAndSortValue is just a long (32 bits ord and 32 bits count) but the representation is less important here.

@martijnvg martijnvg self-assigned this Jun 16, 2014
@rmuir
Copy link
Contributor Author

rmuir commented Jun 16, 2014

I experimented some more, this is only one perf problem. The other one is the massive amount of Bucket objects created (maxOrd).

We should do this more like a lucene collector: use the PQ more efficiently so we only create queue_size Bucket objects. With high cardinality fields, this method is just as much of a hotspot as collecting ordinals so we should optimize it as such: at least as a step avoid creating Bucket objects if it can't compete with the PQ, but maybe later try to optimize the loop with sentinels and stuff.

@martijnvg martijnvg assigned rmuir and unassigned martijnvg Jun 16, 2014
rmuir added a commit to rmuir/elasticsearch that referenced this issue Jun 16, 2014
@rmuir rmuir closed this as completed in 3892b6c Jun 16, 2014
@rmuir rmuir added bug labels Jun 16, 2014
rmuir added a commit that referenced this issue Jun 16, 2014
@spinscale spinscale changed the title GlobalOrdinalsStringTermsAggregator is inefficient for high-cardinality fields Aggregations: GlobalOrdinalsStringTermsAggregator is inefficient for high-cardinality fields Jun 18, 2014
@martijnvg martijnvg added enhancement and removed bug labels Jul 1, 2014
@clintongormley clintongormley changed the title Aggregations: GlobalOrdinalsStringTermsAggregator is inefficient for high-cardinality fields GlobalOrdinalsStringTermsAggregator is inefficient for high-cardinality fields Jun 7, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants