Disk-based field data #3806

jpountz · 2013-09-30T12:56:06Z

Lucene 4.0 introduced doc values, which are very similar to field data except that they are computed and persisted to disk (per segment) at index time. At search time, these data-structures are either loaded into memory or directly read from disk depending on the doc values format. Starting with Lucene 4.5, the default is to load the small data-structures that matter for performance into memory (ordinals) and to keep the large data-structures on disk (values). It would be interesting to have a new field data implementation that would be backed by doc values.

Integration into Elasticsearch would allow for having disk-based field data and for configuring smaller heaps, which would be less subject to garbage collection issues. On the other hand, this will require additional disk space and since doc values are disk-based by default, they will probably be slower for field-data-intensive workloads.

This commit allows for using Lucene doc values as a backend for field data, moving the cost of building field data from the refresh operation to indexing. In addition, Lucene doc values can be stored on disk (partially, or even entirely), so that memory management is done at the operating system level (file-system cache) instead of the JVM, avoiding long pauses during major collections due to large heaps. So far doc values are supported on numeric types and non-analyzed strings (index:no or index:not_analyzed). Under the hood, it uses SORTED_SET doc values which is the only type to support multi-valued fields. Since the field data API set is a bit wider than the doc values API set, some operations are not supported: - field data filtering: this will fail if doc values are enabled, - field data cache clearing, even for memory-based doc values formats, - getting the memory usage for a specific field, - knowing whether a field is actually multi-valued. This commit also allows for configuring doc-values formats on a per-field basis similarly to postings formats. In particular the doc values format of the _version field can be configured through its own field mapper (it used to be handled in UidFieldMapper previously). Closes elastic#3806

…indices (elastic#3806)

…indices (#3806)

ghost assigned jpountz Sep 30, 2013

jpountz mentioned this issue Oct 4, 2013

Doc values integration. #3829

Closed

jpountz closed this as completed in 4fa8f6f Oct 9, 2013

johtani mentioned this issue Jan 28, 2014

Change es version to 1.0.0.RC1 codelibs/elasticsearch-solr-api#1

Merged

talevy added a commit to talevy/elasticsearch that referenced this issue Apr 25, 2018

disallow deleting lifecycle policies that are referenced by existing …

0a3dceb

…indices (elastic#3806)

talevy added a commit to talevy/elasticsearch that referenced this issue May 14, 2018

disallow deleting lifecycle policies that are referenced by existing …

5f825bc

…indices (elastic#3806)

jasontedor pushed a commit that referenced this issue Aug 17, 2018

disallow deleting lifecycle policies that are referenced by existing …

950c50f

…indices (#3806)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disk-based field data #3806

Disk-based field data #3806

jpountz commented Sep 30, 2013

Disk-based field data #3806

Disk-based field data #3806

Comments

jpountz commented Sep 30, 2013