New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disk-based field data #3806
Labels
Comments
ghost
assigned jpountz
Sep 30, 2013
jpountz
added a commit
to jpountz/elasticsearch
that referenced
this issue
Oct 4, 2013
This commit allows for using Lucene doc values as a backend for field data, moving the cost of building field data from the refresh operation to indexing. In addition, Lucene doc values can be stored on disk (partially, or even entirely), so that memory management is done at the operating system level (file-system cache) instead of the JVM, avoiding long pauses during major collections due to large heaps. So far doc values are supported on numeric types and non-analyzed strings (index:no or index:not_analyzed). Under the hood, it uses SORTED_SET doc values which is the only type to support multi-valued fields. Since the field data API set is a bit wider than the doc values API set, some operations are not supported: - field data filtering: this will fail if doc values are enabled, - field data cache clearing, even for memory-based doc values formats, - getting the memory usage for a specific field, - knowing whether a field is actually multi-valued. This commit also allows for configuring doc-values formats on a per-field basis similarly to postings formats. In particular the doc values format of the _version field can be configured through its own field mapper (it used to be handled in UidFieldMapper previously). Closes elastic#3806
talevy
added a commit
to talevy/elasticsearch
that referenced
this issue
Apr 25, 2018
talevy
added a commit
to talevy/elasticsearch
that referenced
this issue
May 14, 2018
jasontedor
pushed a commit
that referenced
this issue
Aug 17, 2018
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Lucene 4.0 introduced doc values, which are very similar to field data except that they are computed and persisted to disk (per segment) at index time. At search time, these data-structures are either loaded into memory or directly read from disk depending on the doc values format. Starting with Lucene 4.5, the default is to load the small data-structures that matter for performance into memory (ordinals) and to keep the large data-structures on disk (values). It would be interesting to have a new field data implementation that would be backed by doc values.
Integration into Elasticsearch would allow for having disk-based field data and for configuring smaller heaps, which would be less subject to garbage collection issues. On the other hand, this will require additional disk space and since doc values are disk-based by default, they will probably be slower for field-data-intensive workloads.
The text was updated successfully, but these errors were encountered: