Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add global ordinals #5672

Closed
wants to merge 2 commits into from
Closed

Add global ordinals #5672

wants to merge 2 commits into from

Conversation

martijnvg
Copy link
Member

Global ordinals is a data-structure on top of field data, that maintains an incremental numbering for all the terms in field data in a lexicographic order. The performance of search features like terms aggregator can be improved using global ordinals.

This PR also adds a new execution mode global_ordinals to terms aggregation, that is used by default for non nested terms aggregations.

to improve the execution time. Search features can instead of using the actual
terms, can use these unique numbers which improves the performance.

Global ordinals are build based on the field data entries for each segment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*built

@jpountz
Copy link
Contributor

jpountz commented Apr 3, 2014

This looks really great! Maybe something that would deserve more testing is making sure that global ords get out of the cache when a top-level reader gets closed?


Global ordinals can be beneficial in search features that use segment ordinals already
such as the terms aggregator to improve the execution time. Often these search features
need to need to merge the segment ordinal results to a cross segment terms result. With
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"need to need to"

@@ -812,6 +813,55 @@ public void awaitTermination() throws InterruptedException {
};
}

@Override
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed there is an issue upper in this file:

                    if (fieldDataType.getLoading() != Loading.EAGER) {
                        continue;
                    }

But now that there is also EAGER_GLOBAL_ORDINALS, I think the condition should be if (fieldDataType.getLoading() == Loading.LAZY) since we need to warm up per-segment data before global ords?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right! that could lead to a weird bug...

@jpountz
Copy link
Contributor

jpountz commented Apr 4, 2014

Just did a second round, I think this is very close!

@jpountz
Copy link
Contributor

jpountz commented Apr 4, 2014

LGTM!

…lddata.

Added a terms aggregation implementations that work on global ordinals, which is also the default.

Closes #5672
…lddata.

Added a terms aggregation implementations that work on global ordinals, which is also the default.

Closes #5672
martijnvg added a commit that referenced this pull request Apr 7, 2014
…lddata.

Added a terms aggregation implementations that work on global ordinals, which is also the default.

Closes #5672
@martijnvg martijnvg closed this in ade1d0e Apr 7, 2014
@martijnvg
Copy link
Member Author

I added the benchmark results for the terms aggregator comparing running with execution hint global_ordinals and execution hint ordinals.

Each row is the result of executing terms aggregator on a string field a 100 times. All the fields have index set to not_analyzed and the index contains 5M docs. Also for the purpose of the benchmark the index has just a single shard. The first column is the name of field the terms aggregator was executed against. The field suffix tells the number of unique values in that field. The took columns reports the total amount of time it took the executes the terms aggregator, the millis report the average time to execute a single terms aggregator and lastly the field size column tells how memory the field data structures took in memory for the termsaggregator field.

Running terms aggregator with execution hint ordinals:

Field name took millis fieldata size
field_64 18.1s 181 5.1mb
field_128 17.9s 179 5.1mb
field_256 15.9s 159 9.9mb
field_8192 23.7s 237 12.2mb
field_32768 35.6s 356 16.6mb
field_65536 59.4s 594 23.9mb
field_131072 1.7m 1041 34.3mb
field_524288 3.6m 2197 46mb
field_1048576 4.3m 2602 51.5mb
field_2097152 4.9m 2991 55.8mb

Running terms aggregator with execution hint global_ordinals:

Field name took millis fieldata size
field_64 16.5s 165 5.1mb
field_128 16.4s 164 5.1mb
field_256 16.3s 163 9.9mb
field_8192 20.2s 202 12.3mb
field_32768 21.4s 214 17mb
field_65536 22.1s 221 24.7mb
field_131072 23.5s 235 35.8mb
field_524288 46.6s 466 50.7mb
field_1048576 1.1m 673 59mb
field_2097152 1.3m 835 67.2mb

Clearly for high cardinality fields using global ordinals is a big win. Comparing the last runs of both test runs, global ordinals is more than 3 times faster. On low cardinality fields there is no clear winner and the difference are small. The noise (jvm gc, hotspot) is intervening with the actual result. The memory overhead that global ordinals add to field data is several times smaller than the field data itself is taking.

@jpountz
Copy link
Contributor

jpountz commented Apr 7, 2014

The memory overhead that global ordinals add to field data is several times smaller than the field data itself is taking.

I'd add to that that global ordinals might seem wasteful memory-wise since field data reports higher memory usage on high-cardinality fields, but actually the aggregator uses significantly less transient memory since it doesn't need to load term bytes into memory anymore to compare terms across segments. In the end, if you sum up the amount of memory that is needed to store field data with the amount of memory that is needed to store/compute counts for a particular query, global ordinals very likely require less memory.

@uboness
Copy link
Contributor

uboness commented Apr 7, 2014

++ to get a more realistice view of the mem footprint of just ordinals, you'll need to take snapshots of the agtor as well (it's not just field data)

@otisg
Copy link

otisg commented Apr 9, 2014

What is the guidance for when one should give global_ordinals vs. just ordinals hint.... if you don't know the field cardinality?

@martijnvg
Copy link
Member Author

@otisg Currently terms aggregation builds a cross segment bucket_id --> term lookup during the execution which is similar to what global ordinals is, but on a per request basis. This logic has now moved from terms aggs to fielddata, so that once global ordinals are built they can be reused for subsequent requests. Terms aggs can just directly use the global ordinals from fielddata as bucket_ids, which like @jpountz explained make terms aggs use way less transient memory than they did before. So one should use execution hint global ordinals all the time, since it is a win for both low and high cardinality fields, this is the reason why global_ordinals is also the default.

@martijnvg martijnvg deleted the feature/global-ordinals branch April 10, 2014 17:10
@tlrx
Copy link
Member

tlrx commented Apr 14, 2014

Thanks for this!

@clintongormley clintongormley added :Fielddata and removed :Cache :Core/Infra/Core Core issues without another label :Search/Mapping Index mappings, including merging and defining field types :Search/Search Search-related issues that do not fall into other categories labels Jun 7, 2015
@clintongormley clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Fielddata labels Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories v1.2.0 v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants