Add global ordinals #5672

martijnvg · 2014-04-03T08:50:47Z

Global ordinals is a data-structure on top of field data, that maintains an incremental numbering for all the terms in field data in a lexicographic order. The performance of search features like terms aggregator can be improved using global ordinals.

This PR also adds a new execution mode global_ordinals to terms aggregation, that is used by default for non nested terms aggregations.

jpountz · 2014-04-03T08:52:22Z

docs/reference/index-modules/fielddata.asciidoc

+to improve the execution time. Search features can instead of using the actual
+terms, can use these unique numbers which improves the performance.
+
+Global ordinals are build based on the field data entries for each segment


jpountz · 2014-04-03T11:30:51Z

This looks really great! Maybe something that would deserve more testing is making sure that global ords get out of the cache when a top-level reader gets closed?

jpountz · 2014-04-04T12:17:40Z

docs/reference/index-modules/fielddata.asciidoc

+
+Global ordinals can be beneficial in search features that use segment ordinals already
+such as the terms aggregator to improve the execution time. Often these search features
+need to need to merge the segment ordinal results to a cross segment terms result. With


"need to need to"

jpountz · 2014-04-04T13:26:41Z

src/main/java/org/elasticsearch/search/SearchService.java

@@ -812,6 +813,55 @@ public void awaitTermination() throws InterruptedException {
            };
        }

+        @Override


Just noticed there is an issue upper in this file:

if (fieldDataType.getLoading() != Loading.EAGER) { continue; }

But now that there is also EAGER_GLOBAL_ORDINALS, I think the condition should be if (fieldDataType.getLoading() == Loading.LAZY) since we need to warm up per-segment data before global ords?

right! that could lead to a weird bug...

jpountz · 2014-04-04T13:38:11Z

Just did a second round, I think this is very close!

jpountz · 2014-04-04T17:48:36Z

LGTM!

…lddata. Added a terms aggregation implementations that work on global ordinals, which is also the default. Closes #5672

martijnvg · 2014-04-07T14:33:46Z

I added the benchmark results for the terms aggregator comparing running with execution hint global_ordinals and execution hint ordinals.

Each row is the result of executing terms aggregator on a string field a 100 times. All the fields have index set to not_analyzed and the index contains 5M docs. Also for the purpose of the benchmark the index has just a single shard. The first column is the name of field the terms aggregator was executed against. The field suffix tells the number of unique values in that field. The took columns reports the total amount of time it took the executes the terms aggregator, the millis report the average time to execute a single terms aggregator and lastly the field size column tells how memory the field data structures took in memory for the termsaggregator field.

Running terms aggregator with execution hint ordinals:

Field name	took	millis	fieldata size
field_64	18.1s	181	5.1mb
field_128	17.9s	179	5.1mb
field_256	15.9s	159	9.9mb
field_8192	23.7s	237	12.2mb
field_32768	35.6s	356	16.6mb
field_65536	59.4s	594	23.9mb
field_131072	1.7m	1041	34.3mb
field_524288	3.6m	2197	46mb
field_1048576	4.3m	2602	51.5mb
field_2097152	4.9m	2991	55.8mb

Running terms aggregator with execution hint global_ordinals:

Field name	took	millis	fieldata size
field_64	16.5s	165	5.1mb
field_128	16.4s	164	5.1mb
field_256	16.3s	163	9.9mb
field_8192	20.2s	202	12.3mb
field_32768	21.4s	214	17mb
field_65536	22.1s	221	24.7mb
field_131072	23.5s	235	35.8mb
field_524288	46.6s	466	50.7mb
field_1048576	1.1m	673	59mb
field_2097152	1.3m	835	67.2mb

Clearly for high cardinality fields using global ordinals is a big win. Comparing the last runs of both test runs, global ordinals is more than 3 times faster. On low cardinality fields there is no clear winner and the difference are small. The noise (jvm gc, hotspot) is intervening with the actual result. The memory overhead that global ordinals add to field data is several times smaller than the field data itself is taking.

jpountz · 2014-04-07T14:53:46Z

The memory overhead that global ordinals add to field data is several times smaller than the field data itself is taking.

I'd add to that that global ordinals might seem wasteful memory-wise since field data reports higher memory usage on high-cardinality fields, but actually the aggregator uses significantly less transient memory since it doesn't need to load term bytes into memory anymore to compare terms across segments. In the end, if you sum up the amount of memory that is needed to store field data with the amount of memory that is needed to store/compute counts for a particular query, global ordinals very likely require less memory.

uboness · 2014-04-07T18:53:55Z

++ to get a more realistice view of the mem footprint of just ordinals, you'll need to take snapshots of the agtor as well (it's not just field data)

otisg · 2014-04-09T00:24:47Z

What is the guidance for when one should give global_ordinals vs. just ordinals hint.... if you don't know the field cardinality?

martijnvg · 2014-04-09T04:21:38Z

@otisg Currently terms aggregation builds a cross segment bucket_id --> term lookup during the execution which is similar to what global ordinals is, but on a per request basis. This logic has now moved from terms aggs to fielddata, so that once global ordinals are built they can be reused for subsequent requests. Terms aggs can just directly use the global ordinals from fielddata as bucket_ids, which like @jpountz explained make terms aggs use way less transient memory than they did before. So one should use execution hint global ordinals all the time, since it is a win for both low and high cardinality fields, this is the reason why global_ordinals is also the default.

tlrx · 2014-04-14T06:49:41Z

Thanks for this!

jpountz reviewed Apr 3, 2014
View reviewed changes

martijnvg added v1.2.0 and removed Lucene 4.5 Upgrade labels Apr 4, 2014

jpountz reviewed Apr 4, 2014
View reviewed changes

Added global ordinals (unique incremental numbering for terms) to fie…

4f82ba3

…lddata. Added a terms aggregation implementations that work on global ordinals, which is also the default. Closes #5672

Added global ordinals (unique incremental numbering for terms) to fie…

4727c54

…lddata. Added a terms aggregation implementations that work on global ordinals, which is also the default. Closes #5672

martijnvg added a commit that referenced this pull request Apr 7, 2014

Added global ordinals (unique incremental numbering for terms) to fie…

c000de5

…lddata. Added a terms aggregation implementations that work on global ordinals, which is also the default. Closes #5672

martijnvg closed this in ade1d0e Apr 7, 2014

martijnvg deleted the feature/global-ordinals branch April 10, 2014 17:10

jpountz mentioned this pull request May 9, 2014

Add doc values support to _parent field data #6107

Closed

clintongormley added :Fielddata and removed :Cache :Core/Infra/Core Core issues without another label :Search/Mapping Index mappings, including merging and defining field types :Search/Search Search-related issues that do not fall into other categories labels Jun 7, 2015

clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Fielddata labels Feb 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add global ordinals #5672

Add global ordinals #5672

martijnvg commented Apr 3, 2014

jpountz Apr 3, 2014

jpountz commented Apr 3, 2014

jpountz Apr 4, 2014

jpountz Apr 4, 2014

martijnvg Apr 4, 2014

jpountz commented Apr 4, 2014

jpountz commented Apr 4, 2014

martijnvg commented Apr 7, 2014

jpountz commented Apr 7, 2014

uboness commented Apr 7, 2014

otisg commented Apr 9, 2014

martijnvg commented Apr 9, 2014

tlrx commented Apr 14, 2014

Add global ordinals #5672

Add global ordinals #5672

Conversation

martijnvg commented Apr 3, 2014

jpountz Apr 3, 2014

Choose a reason for hiding this comment

jpountz commented Apr 3, 2014

jpountz Apr 4, 2014

Choose a reason for hiding this comment

jpountz Apr 4, 2014

Choose a reason for hiding this comment

martijnvg Apr 4, 2014

Choose a reason for hiding this comment

jpountz commented Apr 4, 2014

jpountz commented Apr 4, 2014

martijnvg commented Apr 7, 2014

jpountz commented Apr 7, 2014

uboness commented Apr 7, 2014

otisg commented Apr 9, 2014

martijnvg commented Apr 9, 2014

tlrx commented Apr 14, 2014