New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve global ordinals on low-cardinality fields #5854
Comments
the other approach, what lucene does, is to detect low cardinality (with respect to number of matching docs) and just collect with segment ords, and then convert over in a second pass. Imagine 1 billion docs with only 5 unique values, this saves a lot of cpu since you arent remapping the same stuff over and over. |
@jpountz Nice idea! The caveat here is that the largest segment does need to have all values, but for low cardinality fields this should already be the case. @rmuir That implies a feature using global ordinals (e.g. terms aggs) to change its behaviour (different execution mode). This nice thing about @jpountz trick is that it is behind the field data ordinals interface and features using it wouldn't know about, so it is a small change. |
@martijnvg right, the two-pass approach requires a different execution mode, thats true, but it does not require any special alignment of ordinals, so its a more general optimization. When I applied this to apache solr last year, I think it doubled faceting performance for low cardinality fields: https://issues.apache.org/jira/browse/SOLR-5512 |
@rmuir When @martijnvg and I worked on global ordinals for Elasticsearch, we considered building aggregations with segment ordinals first and merging in a final step, but this introduced some complexity as well since we would need to add logic to merge sub aggregation buckets together (on every Aggregator impl). On the other hand, global ordinals are very appealing since we can use global ordinals directly as bucket ordinals. I think working with segment ordinals could be interesting for leaf terms aggregators though as in this case we don't need to merge sub aggregations. |
When i benchmarked the change, it was as you would imagine, for high cardinality fields its slower to do two passes, too much overhead. The crazy heuristic to decide is something Mike came up with benchmarking lucene facets, I just stole it. |
…ues for a field on a shard level. Relates to elastic#5854
+1 to explore and add this for leaf terms aggs |
On low-cardinality fields, it is very likely that the large segments are going to contain the same set of values as the whole index. This means that the segments ordinals are already global and that the
segmentOrdToGlobalOrdLookup
is going to be an identity map.We could detect such situations, and directly expose the segment ordinals as global ordinals in order to remove one layer of abstraction.
The text was updated successfully, but these errors were encountered: