Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk-based field data #3806

Closed
jpountz opened this issue Sep 30, 2013 · 0 comments
Closed

Disk-based field data #3806

jpountz opened this issue Sep 30, 2013 · 0 comments

Comments

@jpountz
Copy link
Contributor

jpountz commented Sep 30, 2013

Lucene 4.0 introduced doc values, which are very similar to field data except that they are computed and persisted to disk (per segment) at index time. At search time, these data-structures are either loaded into memory or directly read from disk depending on the doc values format. Starting with Lucene 4.5, the default is to load the small data-structures that matter for performance into memory (ordinals) and to keep the large data-structures on disk (values). It would be interesting to have a new field data implementation that would be backed by doc values.

Integration into Elasticsearch would allow for having disk-based field data and for configuring smaller heaps, which would be less subject to garbage collection issues. On the other hand, this will require additional disk space and since doc values are disk-based by default, they will probably be slower for field-data-intensive workloads.

@ghost ghost assigned jpountz Sep 30, 2013
jpountz added a commit to jpountz/elasticsearch that referenced this issue Oct 4, 2013
This commit allows for using Lucene doc values as a backend for field data,
moving the cost of building field data from the refresh operation to indexing.
In addition, Lucene doc values can be stored on disk (partially, or even
entirely), so that memory management is done at the operating system level
(file-system cache) instead of the JVM, avoiding long pauses during major
collections due to large heaps.

So far doc values are supported on numeric types and non-analyzed strings
(index:no or index:not_analyzed). Under the hood, it uses SORTED_SET doc values
which is the only type to support multi-valued fields. Since the field data API
set is a bit wider than the doc values API set, some operations are not
supported:
 - field data filtering: this will fail if doc values are enabled,
 - field data cache clearing, even for memory-based doc values formats,
 - getting the memory usage for a specific field,
 - knowing whether a field is actually multi-valued.

This commit also allows for configuring doc-values formats on a per-field basis
similarly to postings formats. In particular the doc values format of the
_version field can be configured through its own field mapper (it used to be
handled in UidFieldMapper previously).

Closes elastic#3806
@jpountz jpountz closed this as completed in 4fa8f6f Oct 9, 2013
talevy added a commit to talevy/elasticsearch that referenced this issue Apr 25, 2018
talevy added a commit to talevy/elasticsearch that referenced this issue May 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant