New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve terms aggregation to perform the segment ordinal to global ordinal lookup post segment collection #5895
Conversation
The initial commit adds an extra execution mode for terms aggregation ( |
Merged the |
// Ideally we want to know the amount of docs that are going to match... we don't know because the | ||
// the aggs are executed with the main query and even if we knew for nested aggs it is even harder | ||
// to know the the matching docs. | ||
double postGlobalOrdinalResolvingRatio = segmentOrdinals.getNumOrds() / segmentOrdinals.getNumDocs(); // maybe multiple numDocs with a factor? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think getNumOrds
needs to be casted to a double?
I'm concerned that this change conflicts a bit with #5873. For example, if you have a segment where ordinals are already global, #5873 would use them directly and this would be optimal. On the other hand, this change would collect them into a separate structure and merge it with the global counts when the collection of the segment is terminated. Can we not collect into a different structure when ordinals are already global? (not sure how to detect it cleanly) |
Moreover, I liked better when this execution mode was in its own class since it might have different runtime properties (especially memory usage)? |
// the aggs are executed with the main query and even if we knew for nested aggs it is even harder | ||
// to know the the matching docs. | ||
double postGlobalOrdinalResolvingRatio = segmentOrdinals.getNumOrds() / segmentOrdinals.getNumDocs(); // maybe multiple numDocs with a factor? | ||
if (postGlobalOrdinalResolvingRatio <= 0.9) { // TODO: make configurable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.9
looks high given that it is supposed to be used on low-cardinality fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also wondering if we should apply this strategy if any of the segments matches this criterion. This way we wouldn't need the f (segmentDocCounts != null)
condition in collect
anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that 0.9 is on the high side as well. I was playing around with this threshold and I did see a small improvement even for high cardinality fields, but for those fields the additional memory usage caused by segmentCounts
should also be taken into account, so we must set this to a lower value
Initially I put this enhancement into a different execution hint and I moved it into
So maybe we should be more conservative when using this post segment collection global ordinal resolving to be sure that update: We can in the
I think this can be detected. If the |
I tried to think more about when to use this execution mode:
So in the end, it looks to me like this new execution mode would be safe/useful on single-level terms aggregations? (which might still be quite common)
+1
I tend to dislike |
Use segment maxOrd and global maxOrd to detected if global ordinals lookup needs to be performed
Updated the PR to have a dedicated
I replaced that with |
mapSegmentCountsToGlobalCounts(); | ||
Releasables.close(segmentDocCounts); | ||
segmentDocCounts = null; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be done in postCollect instead?
…Collect Compute maxOrd in factory if global ordinals is going to be used. Use maxOrd to pick GLOBAL_ORDINALS or GLOBAL_ORDINALS_LOW_CARDINALITY For GLOBAL_ORDINALS and GLOBAL_ORDINALS_LOW_CARDINALITY use maxOrd as bucket count
Thanks for reviewing this @jpountz! I Updated PR with the following changes:
|
@@ -159,6 +161,8 @@ public void setNeedsGlobalOrdinals(boolean needsGlobalOrdinals) {} | |||
|
|||
public abstract BytesValues.WithOrdinals globalBytesValues(); | |||
|
|||
public abstract long maxOrd(IndexSearcher indexSearcher); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be called globalMaxOrd
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should, I'll change that.
LGTM |
In case when there are not too many unique values it is better to do the segment ordinal to global ordinal lookup after segment results have been processed.