Speed up `exists` and `missing` filters on high-cardinality fields #5659

jpountz · 2014-04-02T11:13:48Z

The way that the exists filter works is by merging all postings lists. missing just wraps an exists filter into a not filter.

Merging all postings lists can however be very slow on high-cardinality fields. I think there are two ways to fix it:

make these filters run on top of field data,
or add a new metadata field that we could eg. call _field_names that would index all field names of a document.

Working on field data has the drawback of requiring a lot of stuff to be loaded into memory if the field doesn't have doc values, and the returned filter cannot skip.

I tend to like indexing field names because it would not load anything into memory with a default setup, and the returned filter could skip efficiently since it would be based on a postings list. But unfortunately it could not be used on indices that have been created before we introduce this new metadata field.

The text was updated successfully, but these errors were encountered:

s1monw · 2014-04-02T11:29:28Z

I really like the _field_names approach!

uboness · 2014-04-02T11:30:24Z

+1 on the _field_names approach... I think we'll find them to be useful for other things as well

clintongormley · 2014-04-02T11:54:34Z

+1 on _field_names - awesome solution

dadoonet · 2014-04-02T12:28:08Z

So it's like _all but for fields names?
In the future, could we need to have separate fields names and need to use a "copy_fieldname_to" feature?

Thinking about it loud. May be there is no use case for that...

clintongormley · 2014-05-13T10:04:32Z

I don't want this to be forgotten, so I've added a v1.3.0 label. No pressure ;)

The `exists` and `missing` filters need to merge postings lists of all existing terms, which can be very costly, especially on high-cardinality fields. This commit indexes the field names of a document under `_field_names` and reuses it to speed up the `exists` and `missing` filters. This is only enabled for indices that are created on or after Elasticsearch 1.3.0. Close elastic#5659

jpountz · 2014-05-23T09:17:57Z

@spinscale asked me about the disk footprint of this feature. In general it is very low: its index options are DOCS_ONLY and is only enabled for fields that are indexed or have doc values. Additionally, fields that are contained in most documents will have dense postings lists that typically compress very well.

I did some experiments in 2 extreme cases:

5 user-defined fields in addition to the metadata fields (_uid, etc.) that are contained in every document: overhead of ~0.5 bytes per document.
100 user-defined fields in the mapping and each document has 5 random fields from these 100 fields: overhead of ~4.7 bytes per document.

This looks very reasonable to me. Even the 2nd case which has very sparse documents takes less than one byte per field per document.

The `exists` and `missing` filters need to merge postings lists of all existing terms, which can be very costly, especially on high-cardinality fields. This commit indexes the field names of a document under `_field_names` and reuses it to speed up the `exists` and `missing` filters. This is only enabled for indices that are created on or after Elasticsearch 1.3.0. Close #5659

clintongormley added the v1.3.0 label May 13, 2014

jpountz added enhancement labels May 13, 2014

jpountz self-assigned this May 21, 2014

jpountz mentioned this issue May 21, 2014

Index field names of documents. #6269

Closed

jpountz closed this as completed in 703dbff Jun 19, 2014

jpountz added the highlight label Jun 19, 2014

jpountz changed the title ~~exists and missing filters are slow on high-cardinality fields~~ Queries: exists and missing filters are slow on high-cardinality fields Jun 19, 2014

brwe mentioned this issue Jul 1, 2014

Infrastructure for changing easily the significance terms heuristic #6561

Closed

clintongormley changed the title ~~Queries: exists and missing filters are slow on high-cardinality fields~~ Search: Speed up exists and missing filters on high-cardinality fields Jul 16, 2014

jpountz mentioned this issue Aug 20, 2014

Search result changed since 1.24 (current 1.3.2) #7348

Closed

hxuanji mentioned this issue Sep 1, 2014

Finding documents with empty string as value #7515

Closed

clintongormley added the :Search/Search Search-related issues that do not fall into other categories label Jun 7, 2015

clintongormley changed the title ~~Search: Speed up exists and missing filters on high-cardinality fields~~ Speed up exists and missing filters on high-cardinality fields Jun 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up `exists` and `missing` filters on high-cardinality fields #5659

Speed up `exists` and `missing` filters on high-cardinality fields #5659

jpountz commented Apr 2, 2014

s1monw commented Apr 2, 2014

uboness commented Apr 2, 2014

clintongormley commented Apr 2, 2014

dadoonet commented Apr 2, 2014

clintongormley commented May 13, 2014

jpountz commented May 23, 2014

Speed up exists and missing filters on high-cardinality fields #5659

Speed up exists and missing filters on high-cardinality fields #5659

Comments

jpountz commented Apr 2, 2014

s1monw commented Apr 2, 2014

uboness commented Apr 2, 2014

clintongormley commented Apr 2, 2014

dadoonet commented Apr 2, 2014

clintongormley commented May 13, 2014

jpountz commented May 23, 2014

Speed up `exists` and `missing` filters on high-cardinality fields #5659

Speed up `exists` and `missing` filters on high-cardinality fields #5659