Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up exists and missing filters on high-cardinality fields #5659

Closed
jpountz opened this issue Apr 2, 2014 · 6 comments
Closed

Speed up exists and missing filters on high-cardinality fields #5659

jpountz opened this issue Apr 2, 2014 · 6 comments
Assignees
Labels
>enhancement release highlight :Search/Search Search-related issues that do not fall into other categories v1.3.0 v2.0.0-beta1

Comments

@jpountz
Copy link
Contributor

jpountz commented Apr 2, 2014

The way that the exists filter works is by merging all postings lists. missing just wraps an exists filter into a not filter.

Merging all postings lists can however be very slow on high-cardinality fields. I think there are two ways to fix it:

  1. make these filters run on top of field data,
  2. or add a new metadata field that we could eg. call _field_names that would index all field names of a document.

Working on field data has the drawback of requiring a lot of stuff to be loaded into memory if the field doesn't have doc values, and the returned filter cannot skip.

I tend to like indexing field names because it would not load anything into memory with a default setup, and the returned filter could skip efficiently since it would be based on a postings list. But unfortunately it could not be used on indices that have been created before we introduce this new metadata field.

@s1monw
Copy link
Contributor

s1monw commented Apr 2, 2014

I really like the _field_names approach!

@uboness
Copy link
Contributor

uboness commented Apr 2, 2014

+1 on the _field_names approach... I think we'll find them to be useful for other things as well

@clintongormley
Copy link

+1 on _field_names - awesome solution

@dadoonet
Copy link
Member

dadoonet commented Apr 2, 2014

So it's like _all but for fields names?
In the future, could we need to have separate fields names and need to use a "copy_fieldname_to" feature?

Thinking about it loud. May be there is no use case for that...

@clintongormley
Copy link

I don't want this to be forgotten, so I've added a v1.3.0 label. No pressure ;)

@jpountz jpountz self-assigned this May 21, 2014
jpountz added a commit to jpountz/elasticsearch that referenced this issue May 21, 2014
The `exists` and `missing` filters need to merge postings lists of all existing
terms, which can be very costly, especially on high-cardinality fields. This
commit indexes the field names of a document under `_field_names` and reuses it
to speed up the `exists` and `missing` filters.

This is only enabled for indices that are created on or after Elasticsearch
1.3.0.

Close elastic#5659
@jpountz
Copy link
Contributor Author

jpountz commented May 23, 2014

@spinscale asked me about the disk footprint of this feature. In general it is very low: its index options are DOCS_ONLY and is only enabled for fields that are indexed or have doc values. Additionally, fields that are contained in most documents will have dense postings lists that typically compress very well.

I did some experiments in 2 extreme cases:

  • 5 user-defined fields in addition to the metadata fields (_uid, etc.) that are contained in every document: overhead of ~0.5 bytes per document.
  • 100 user-defined fields in the mapping and each document has 5 random fields from these 100 fields: overhead of ~4.7 bytes per document.

This looks very reasonable to me. Even the 2nd case which has very sparse documents takes less than one byte per field per document.

jpountz added a commit that referenced this issue Jun 19, 2014
The `exists` and `missing` filters need to merge postings lists of all existing
terms, which can be very costly, especially on high-cardinality fields. This
commit indexes the field names of a document under `_field_names` and reuses it
to speed up the `exists` and `missing` filters.

This is only enabled for indices that are created on or after Elasticsearch
1.3.0.

Close #5659
@jpountz jpountz changed the title exists and missing filters are slow on high-cardinality fields Queries: exists and missing filters are slow on high-cardinality fields Jun 19, 2014
@clintongormley clintongormley changed the title Queries: exists and missing filters are slow on high-cardinality fields Search: Speed up exists and missing filters on high-cardinality fields Jul 16, 2014
@clintongormley clintongormley added the :Search/Search Search-related issues that do not fall into other categories label Jun 7, 2015
@clintongormley clintongormley changed the title Search: Speed up exists and missing filters on high-cardinality fields Speed up exists and missing filters on high-cardinality fields Jun 7, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement release highlight :Search/Search Search-related issues that do not fall into other categories v1.3.0 v2.0.0-beta1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants