Doc values integration.

This commit allows for using Lucene doc values as a backend for field data, moving the cost of building field data from the refresh operation to indexing. In addition, Lucene doc values can be stored on disk (partially, or even entirely), so that memory management is done at the operating system level (file-system cache) instead of the JVM, avoiding long pauses during major collections due to large heaps. So far doc values are supported on numeric types and non-analyzed strings (index:no or index:not_analyzed). Under the hood, it uses SORTED_SET doc values which is the only type to support multi-valued fields. Since the field data API set is a bit wider than the doc values API set, some operations are not supported: - field data filtering: this will fail if doc values are enabled, - field data cache clearing, even for memory-based doc values formats, - getting the memory usage for a specific field, - knowing whether a field is actually multi-valued. This commit also allows for configuring doc-values formats on a per-field basis similarly to postings formats. In particular the doc values format of the _version field can be configured through its own field mapper (it used to be handled in UidFieldMapper previously). Closes #3806
elastic · Oct 9, 2013 · 4fa8f6f · 4fa8f6f
1 parent 4b96b2c
commit 4fa8f6f
Show file tree

Hide file tree

Showing 107 changed files with 3,978 additions and 1,453 deletions.
diff --git a/docs/reference/index-modules/codec.asciidoc b/docs/reference/index-modules/codec.asciidoc
@@ -4,10 +4,14 @@
 Codecs define how documents are written to disk and read from disk. The
 postings format is the part of the codec that responsible for reading
 and writing the term dictionary, postings lists and positions, payloads
-and offsets stored in the postings list.
-
-Configuring custom postings formats is an expert feature and most likely
-using the builtin postings formats will suite your needs as is described
+and offsets stored in the postings list. The doc values format is
+responsible for reading column-stride storage for a field and is typically
+used for sorting or faceting. When a field doesn't have doc values enabled,
+it is still possible to sort or facet by loading field values from the
+inverted index into main memory.
+
+Configuring custom postings or doc values formats is an expert feature and
+most likely using the builtin formats will suit your needs as is described
 in the <<mapping-core-types,mapping section>>
 
 [float]
@@ -170,3 +174,74 @@ The default postings format has the following options:
     dictionary uses to encode on-disk blocks. Defaults to *48*.
 
 Type name: `default`
+
+[float]
+=== Configuring a custom doc values format
+
+Custom doc values format can be defined in the index settings in the
+`codec` part. The `codec` part can be configured when creating an index
+or updating index settings. An example on how to define your custom
+doc values format:
+
+[source,js]
+--------------------------------------------------
+curl -XPUT 'http://localhost:9200/twitter/' -d '{
+    "settings" : {
+        "index" : {
+            "codec" : {
+                "doc_values_format" : {
+                    "my_format" : {
+                        "type" : "disk"
+                    }
+                }
+            }
+        }
+    }
+}'
+--------------------------------------------------
+
+Then we defining your mapping your can use the `my_format` name in the
+`doc_values_format` option as the example below illustrates:
+
+[source,js]
+--------------------------------------------------
+{
+  "product" : {
+     "properties" : {
+         "price" : {"type" : "integer", "doc_values_format" : "my_format"}
+     }
+  }
+}
+--------------------------------------------------
+
+[float]
+=== Available doc values formats
+
+[float]
+==== Memory doc values format
+
+A doc values format that stores all values in a FST in RAM. This format does
+write to disk but the whole data-structure is loaded into memory when reading
+the index. The memory postings format has no options.
+
+Type name: `memory`
+
+[float]
+==== Disk doc values format
+
+A doc values format that stores and reads everything from disk. Although it may
+be slightly slower than the default doc values format, this doc values format
+will require almost no memory from the JVM. The disk doc values format has no
+options.
+
+Type name: `disk`
+
+[float]
+==== Default doc values format
+
+The default doc values format tries to make a good compromise between speed and
+memory usage by only loading into memory data-structures that matter for
+performance. This makes this doc values format a good fit for most use-cases.
+The default doc values format has no options.
+
+Type name: `default`
diff --git a/docs/reference/index-modules/fielddata.asciidoc b/docs/reference/index-modules/fielddata.asciidoc
@@ -24,6 +24,59 @@ field data after a certain time of inactivity. Defaults to `-1`. For
 example, can be set to `5m` for a 5 minute expiry.
 |=======================================================================
 
+=== Field data formats
+
+Depending on the field type, there might be several field data types
+available. In particular, string and numeric types support the `doc_values`
+format which allows for computing the field data data-structures at indexing
+time and storing them on disk. Although it will make the index larger and may
+be slightly slower, this implementation will be more near-realtime-friendly
+and will require much less memory from the JVM than other implementations.
+
+[source,js]
+--------------------------------------------------
+{
+    tag: {
+        type:      "string",
+        fielddata: {
+            format: "fst"
+        }
+    }
+}
+--------------------------------------------------
+
+[float]
+==== String field data types
+
+`paged_bytes` (default)::
+    Stores unique terms sequentially in a large buffer and maps documents to
+    the indices of the terms they contain in this large buffer.
+
+`fst`::
+    Stores terms in a FST. Slower to build than `paged_bytes` but can help lower
+    memory usage if many terms share common prefixes and/or suffixes.
+
+`doc_values`::
+    Computes and stores field data data-structures on disk at indexing time.
+    Lowers memory usage but only works on non-analyzed strings (`index`: `no` or
+    `not_analyzed`) and doesn't support filtering.
+
+[float]
+==== Numeric field data types
+
+`array` (default)::
+    Stores field values in memory using arrays. 
+
+`doc_values`::
+    Computes and stores field data data-structures on disk at indexing time.
+    Doesn't support filtering.
+
+[float]
+==== Geo point field data types
+
+`array` (default)::
+    Stores latitudes and longitudes in arrays.
+
 [float]
 === Fielddata loading
 

diff --git a/docs/reference/mapping/types/core-types.asciidoc b/docs/reference/mapping/types/core-types.asciidoc
@@ -462,7 +462,53 @@ custom postings format. See
 information.
 
 [float]
-[[similarity]]
+==== Doc values format
+
+Doc values formats define how fields are written into column-stride storage in
+the index for the purpose of sorting or faceting. Fields that have doc values
+enabled will have special field data instances, which will not be uninverted
+from the inverted index, but directly read from disk. This makes _refresh faster
+and ultimately allows for having field data stored on disk depending on the
+configured doc values format.
+
+Doc values formats are configurable. Elasticsearch has several builtin formats:
+
+`memory`::
+        A doc values format which stores data in memory. Compared to the default
+        field data implementations, using doc values with this format will have
+        similar performance but will be faster to load, making '_refresh' less
+        time-consuming.
+
+`disk`::
+        A doc values format which stores all data on disk, requiring almost no
+        memory from the JVM at the cost of a slight performance degradation.
+
+`default`::
+        The default Elasticsearch doc values format, offering good performance
+        with low memory usage. This format is used if no format is specified in
+        the field mapping.
+
+[float]
+===== Doc values format example
+
+On all field types, it is possible to configure a `doc_values_format` attribute:
+
+[source,js]
+--------------------------------------------------
+{
+  "product" : {
+     "properties" : {
+         "price" : {"type" : "integer", "doc_values_format" : "memory"}
+     }
+  }
+}
+--------------------------------------------------
+
+On top of using the built-in doc values formats it is possible to define
+custom doc values formats. See
+<<index-modules-codec,codec module>> for more information.
+
+[float]
 ==== Similarity
 
 Elasticsearch allows you to configure a similarity (scoring algorithm) per field.

diff --git a/src/main/java/org/elasticsearch/action/mlt/TransportMoreLikeThisAction.java b/src/main/java/org/elasticsearch/action/mlt/TransportMoreLikeThisAction.java
@@ -274,6 +274,9 @@ private void parseSource(GetResponse getResponse, final BoolQueryBuilder boolBui
         docMapper.parse(SourceToParse.source(getResponse.getSourceAsBytesRef()).type(request.type()).id(request.id()), new DocumentMapper.ParseListenerAdapter() {
             @Override
             public boolean beforeFieldAdded(FieldMapper fieldMapper, Field field, Object parseContext) {
+                if (!field.fieldType().indexed()) {
+                    return false;
+                }
                 if (fieldMapper instanceof InternalMapper) {
                     return true;
                 }

diff --git a/src/main/java/org/elasticsearch/common/lucene/uid/Versions.java b/src/main/java/org/elasticsearch/common/lucene/uid/Versions.java
@@ -24,6 +24,7 @@
 import org.apache.lucene.util.BytesRef;
 import org.elasticsearch.common.Numbers;
 import org.elasticsearch.index.mapper.internal.UidFieldMapper;
+import org.elasticsearch.index.mapper.internal.VersionFieldMapper;
 
 import java.io.IOException;
 import java.util.List;
@@ -95,7 +96,7 @@ public static DocIdAndVersion loadDocIdAndVersion(AtomicReaderContext readerCont
         }
 
         // Versions are stored as doc values...
-        final NumericDocValues versions = reader.getNumericDocValues(UidFieldMapper.VERSION);
+        final NumericDocValues versions = reader.getNumericDocValues(VersionFieldMapper.NAME);
         if (versions != null || !terms.hasPayloads()) {
             // only the last doc that matches the _uid is interesting here: if it is deleted, then there is
             // no match otherwise previous docs are necessarily either deleted or nested docs

diff --git a/src/main/java/org/elasticsearch/index/codec/CodecModule.java b/src/main/java/org/elasticsearch/index/codec/CodecModule.java
@@ -27,6 +27,10 @@
 import org.elasticsearch.common.inject.multibindings.MapBinder;
 import org.elasticsearch.common.settings.NoClassSettingsException;
 import org.elasticsearch.common.settings.Settings;
+import org.elasticsearch.index.codec.docvaluesformat.DocValuesFormatProvider;
+import org.elasticsearch.index.codec.docvaluesformat.DocValuesFormatService;
+import org.elasticsearch.index.codec.docvaluesformat.DocValuesFormats;
+import org.elasticsearch.index.codec.docvaluesformat.PreBuiltDocValuesFormatProvider;
 import org.elasticsearch.index.codec.postingsformat.PostingFormats;
 import org.elasticsearch.index.codec.postingsformat.PostingsFormatProvider;
 import org.elasticsearch.index.codec.postingsformat.PostingsFormatService;
@@ -35,9 +39,9 @@
 import java.util.Map;
 
 /**
- * The {@link CodecModule} creates and loads the {@link CodecService} and
- * {@link PostingsFormatService} allowing low level data-structure
- * specialization on a Lucene Segment basis.
+ * The {@link CodecModule} creates and loads the {@link CodecService},
+ * {@link PostingsFormatService} and {@link DocValuesFormatService},
+ * allowing low level data-structure specialization on a Lucene Segment basis.
  * <p>
  * The codec module is the authoritative source for build-in and custom
  * {@link PostingsFormatProvider}. During module bootstrap it processes the
@@ -72,21 +76,25 @@ public class CodecModule extends AbstractModule {
 
     private final Settings indexSettings;
 
-    private Map<String, Class<? extends PostingsFormatProvider>> customProviders = Maps.newHashMap();
+    private final Map<String, Class<? extends PostingsFormatProvider>> customPostingsFormatProviders = Maps.newHashMap();
+    private final Map<String, Class<? extends DocValuesFormatProvider>> customDocValuesFormatProviders = Maps.newHashMap();
 
     public CodecModule(Settings indexSettings) {
         this.indexSettings = indexSettings;
     }
 
     public CodecModule addPostingFormat(String name, Class<? extends PostingsFormatProvider> provider) {
-        this.customProviders.put(name, provider);
+        this.customPostingsFormatProviders.put(name, provider);
         return this;
     }
 
-    @Override
-    protected void configure() {
+    public CodecModule addDocValuesFormat(String name, Class<? extends DocValuesFormatProvider> provider) {
+        this.customDocValuesFormatProviders.put(name, provider);
+        return this;
+    }
 
-        Map<String, Class<? extends PostingsFormatProvider>> postingFormatProviders = Maps.newHashMap(customProviders);
+    private void configurePostingsFormats() {
+        Map<String, Class<? extends PostingsFormatProvider>> postingFormatProviders = Maps.newHashMap(customPostingsFormatProviders);
 
         Map<String, Settings> postingsFormatsSettings = indexSettings.getGroups(PostingsFormatProvider.POSTINGS_FORMAT_SETTINGS_PREFIX);
         for (Map.Entry<String, Settings> entry : postingsFormatsSettings.entrySet()) {
@@ -123,6 +131,53 @@ protected void configure() {
         }
 
         bind(PostingsFormatService.class).asEagerSingleton();
+    }
+
+    private void configureDocValuesFormats() {
+        Map<String, Class<? extends DocValuesFormatProvider>> docValuesFormatProviders = Maps.newHashMap(customDocValuesFormatProviders);
+
+        Map<String, Settings> docValuesFormatSettings = indexSettings.getGroups(DocValuesFormatProvider.DOC_VALUES_FORMAT_SETTINGS_PREFIX);
+        for (Map.Entry<String, Settings> entry : docValuesFormatSettings.entrySet()) {
+            final String name = entry.getKey();
+            final Settings settings = entry.getValue();
+
+            final String sType = settings.get("type");
+            if (sType == null || sType.trim().isEmpty()) {
+                throw new ElasticSearchIllegalArgumentException("DocValuesFormat Factory [" + name + "] must have a type associated with it");
+            }
+
+            final Class<? extends DocValuesFormatProvider> type;
+            try {
+                type = settings.getAsClass("type", null, "org.elasticsearch.index.codec.docvaluesformat.", "DocValuesFormatProvider");
+            } catch (NoClassSettingsException e) {
+                throw new ElasticSearchIllegalArgumentException("The specified type [" + sType + "] for docValuesFormat Factory [" + name + "] can't be found");
+            }
+            docValuesFormatProviders.put(name, type);
+        }
+
+        // now bind
+        MapBinder<String, DocValuesFormatProvider.Factory> docValuesFormatFactoryBinder
+                = MapBinder.newMapBinder(binder(), String.class, DocValuesFormatProvider.Factory.class);
+
+        for (Map.Entry<String, Class<? extends DocValuesFormatProvider>> entry : docValuesFormatProviders.entrySet()) {
+            docValuesFormatFactoryBinder.addBinding(entry.getKey()).toProvider(FactoryProvider.newFactory(DocValuesFormatProvider.Factory.class, entry.getValue())).in(Scopes.SINGLETON);
+        }
+
+        for (PreBuiltDocValuesFormatProvider.Factory factory : DocValuesFormats.listFactories()) {
+            if (docValuesFormatProviders.containsKey(factory.name())) {
+                continue;
+            }
+            docValuesFormatFactoryBinder.addBinding(factory.name()).toInstance(factory);
+        }
+
+        bind(DocValuesFormatService.class).asEagerSingleton();
+    }
+
+    @Override
+    protected void configure() {
+        configurePostingsFormats();
+        configureDocValuesFormats();
+
         bind(CodecService.class).asEagerSingleton();
     }
 }