New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add global ordinals #5672
Add global ordinals #5672
Conversation
to improve the execution time. Search features can instead of using the actual | ||
terms, can use these unique numbers which improves the performance. | ||
|
||
Global ordinals are build based on the field data entries for each segment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*built
This looks really great! Maybe something that would deserve more testing is making sure that global ords get out of the cache when a top-level reader gets closed? |
|
||
Global ordinals can be beneficial in search features that use segment ordinals already | ||
such as the terms aggregator to improve the execution time. Often these search features | ||
need to need to merge the segment ordinal results to a cross segment terms result. With |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"need to need to"
@@ -812,6 +813,55 @@ public void awaitTermination() throws InterruptedException { | |||
}; | |||
} | |||
|
|||
@Override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed there is an issue upper in this file:
if (fieldDataType.getLoading() != Loading.EAGER) {
continue;
}
But now that there is also EAGER_GLOBAL_ORDINALS, I think the condition should be if (fieldDataType.getLoading() == Loading.LAZY)
since we need to warm up per-segment data before global ords?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right! that could lead to a weird bug...
Just did a second round, I think this is very close! |
LGTM! |
…lddata. Added a terms aggregation implementations that work on global ordinals, which is also the default. Closes #5672
…lddata. Added a terms aggregation implementations that work on global ordinals, which is also the default. Closes #5672
…lddata. Added a terms aggregation implementations that work on global ordinals, which is also the default. Closes #5672
I added the benchmark results for the Each row is the result of executing terms aggregator on a string field a 100 times. All the fields have Running
Running
Clearly for high cardinality fields using global ordinals is a big win. Comparing the last runs of both test runs, global ordinals is more than 3 times faster. On low cardinality fields there is no clear winner and the difference are small. The noise (jvm gc, hotspot) is intervening with the actual result. The memory overhead that global ordinals add to field data is several times smaller than the field data itself is taking. |
I'd add to that that global ordinals might seem wasteful memory-wise since field data reports higher memory usage on high-cardinality fields, but actually the aggregator uses significantly less transient memory since it doesn't need to load term bytes into memory anymore to compare terms across segments. In the end, if you sum up the amount of memory that is needed to store field data with the amount of memory that is needed to store/compute counts for a particular query, global ordinals very likely require less memory. |
++ to get a more realistice view of the mem footprint of just ordinals, you'll need to take snapshots of the agtor as well (it's not just field data) |
What is the guidance for when one should give global_ordinals vs. just ordinals hint.... if you don't know the field cardinality? |
@otisg Currently terms aggregation builds a cross segment bucket_id --> term lookup during the execution which is similar to what global ordinals is, but on a per request basis. This logic has now moved from terms aggs to fielddata, so that once global ordinals are built they can be reused for subsequent requests. Terms aggs can just directly use the global ordinals from fielddata as bucket_ids, which like @jpountz explained make terms aggs use way less transient memory than they did before. So one should use execution hint global ordinals all the time, since it is a win for both low and high cardinality fields, this is the reason why global_ordinals is also the default. |
Thanks for this! |
Global ordinals is a data-structure on top of field data, that maintains an incremental numbering for all the terms in field data in a lexicographic order. The performance of search features like terms aggregator can be improved using global ordinals.
This PR also adds a new execution mode
global_ordinals
to terms aggregation, that is used by default for non nested terms aggregations.