Skip to content

Commit

Permalink
Doc values integration.
Browse files Browse the repository at this point in the history
This commit allows for using Lucene doc values as a backend for field data,
moving the cost of building field data from the refresh operation to indexing.
In addition, Lucene doc values can be stored on disk (partially, or even
entirely), so that memory management is done at the operating system level
(file-system cache) instead of the JVM, avoiding long pauses during major
collections due to large heaps.

So far doc values are supported on numeric types and non-analyzed strings
(index:no or index:not_analyzed). Under the hood, it uses SORTED_SET doc values
which is the only type to support multi-valued fields. Since the field data API
set is a bit wider than the doc values API set, some operations are not
supported:
 - field data filtering: this will fail if doc values are enabled,
 - field data cache clearing, even for memory-based doc values formats,
 - getting the memory usage for a specific field,
 - knowing whether a field is actually multi-valued.

This commit also allows for configuring doc-values formats on a per-field basis
similarly to postings formats. In particular the doc values format of the
_version field can be configured through its own field mapper (it used to be
handled in UidFieldMapper previously).

Closes #3806
  • Loading branch information
jpountz committed Oct 9, 2013
1 parent 4b96b2c commit 4fa8f6f
Show file tree
Hide file tree
Showing 107 changed files with 3,978 additions and 1,453 deletions.
83 changes: 79 additions & 4 deletions docs/reference/index-modules/codec.asciidoc
Expand Up @@ -4,10 +4,14 @@
Codecs define how documents are written to disk and read from disk. The
postings format is the part of the codec that responsible for reading
and writing the term dictionary, postings lists and positions, payloads
and offsets stored in the postings list.

Configuring custom postings formats is an expert feature and most likely
using the builtin postings formats will suite your needs as is described
and offsets stored in the postings list. The doc values format is
responsible for reading column-stride storage for a field and is typically
used for sorting or faceting. When a field doesn't have doc values enabled,
it is still possible to sort or facet by loading field values from the
inverted index into main memory.

Configuring custom postings or doc values formats is an expert feature and
most likely using the builtin formats will suit your needs as is described
in the <<mapping-core-types,mapping section>>

[float]
Expand Down Expand Up @@ -170,3 +174,74 @@ The default postings format has the following options:
dictionary uses to encode on-disk blocks. Defaults to *48*.

Type name: `default`

[float]
=== Configuring a custom doc values format

Custom doc values format can be defined in the index settings in the
`codec` part. The `codec` part can be configured when creating an index
or updating index settings. An example on how to define your custom
doc values format:

[source,js]
--------------------------------------------------
curl -XPUT 'http://localhost:9200/twitter/' -d '{
"settings" : {
"index" : {
"codec" : {
"doc_values_format" : {
"my_format" : {
"type" : "disk"
}
}
}
}
}
}'
--------------------------------------------------

Then we defining your mapping your can use the `my_format` name in the
`doc_values_format` option as the example below illustrates:

[source,js]
--------------------------------------------------
{
"product" : {
"properties" : {
"price" : {"type" : "integer", "doc_values_format" : "my_format"}
}
}
}
--------------------------------------------------

[float]
=== Available doc values formats

[float]
==== Memory doc values format

A doc values format that stores all values in a FST in RAM. This format does
write to disk but the whole data-structure is loaded into memory when reading
the index. The memory postings format has no options.

Type name: `memory`

[float]
==== Disk doc values format

A doc values format that stores and reads everything from disk. Although it may
be slightly slower than the default doc values format, this doc values format
will require almost no memory from the JVM. The disk doc values format has no
options.

Type name: `disk`

[float]
==== Default doc values format

The default doc values format tries to make a good compromise between speed and
memory usage by only loading into memory data-structures that matter for
performance. This makes this doc values format a good fit for most use-cases.
The default doc values format has no options.

Type name: `default`
53 changes: 53 additions & 0 deletions docs/reference/index-modules/fielddata.asciidoc
Expand Up @@ -24,6 +24,59 @@ field data after a certain time of inactivity. Defaults to `-1`. For
example, can be set to `5m` for a 5 minute expiry.
|=======================================================================

=== Field data formats

Depending on the field type, there might be several field data types
available. In particular, string and numeric types support the `doc_values`
format which allows for computing the field data data-structures at indexing
time and storing them on disk. Although it will make the index larger and may
be slightly slower, this implementation will be more near-realtime-friendly
and will require much less memory from the JVM than other implementations.

[source,js]
--------------------------------------------------
{
tag: {
type: "string",
fielddata: {
format: "fst"
}
}
}
--------------------------------------------------

[float]
==== String field data types

`paged_bytes` (default)::
Stores unique terms sequentially in a large buffer and maps documents to
the indices of the terms they contain in this large buffer.

`fst`::
Stores terms in a FST. Slower to build than `paged_bytes` but can help lower
memory usage if many terms share common prefixes and/or suffixes.

`doc_values`::
Computes and stores field data data-structures on disk at indexing time.
Lowers memory usage but only works on non-analyzed strings (`index`: `no` or
`not_analyzed`) and doesn't support filtering.

[float]
==== Numeric field data types

`array` (default)::
Stores field values in memory using arrays.

`doc_values`::
Computes and stores field data data-structures on disk at indexing time.
Doesn't support filtering.

[float]
==== Geo point field data types

`array` (default)::
Stores latitudes and longitudes in arrays.

[float]
=== Fielddata loading

Expand Down
48 changes: 47 additions & 1 deletion docs/reference/mapping/types/core-types.asciidoc
Expand Up @@ -462,7 +462,53 @@ custom postings format. See
information.

[float]
[[similarity]]
==== Doc values format

Doc values formats define how fields are written into column-stride storage in
the index for the purpose of sorting or faceting. Fields that have doc values
enabled will have special field data instances, which will not be uninverted
from the inverted index, but directly read from disk. This makes _refresh faster
and ultimately allows for having field data stored on disk depending on the
configured doc values format.

Doc values formats are configurable. Elasticsearch has several builtin formats:

`memory`::
A doc values format which stores data in memory. Compared to the default
field data implementations, using doc values with this format will have
similar performance but will be faster to load, making '_refresh' less
time-consuming.

`disk`::
A doc values format which stores all data on disk, requiring almost no
memory from the JVM at the cost of a slight performance degradation.

`default`::
The default Elasticsearch doc values format, offering good performance
with low memory usage. This format is used if no format is specified in
the field mapping.

[float]
===== Doc values format example

On all field types, it is possible to configure a `doc_values_format` attribute:

[source,js]
--------------------------------------------------
{
"product" : {
"properties" : {
"price" : {"type" : "integer", "doc_values_format" : "memory"}
}
}
}
--------------------------------------------------

On top of using the built-in doc values formats it is possible to define
custom doc values formats. See
<<index-modules-codec,codec module>> for more information.

[float]
==== Similarity

Elasticsearch allows you to configure a similarity (scoring algorithm) per field.
Expand Down
Expand Up @@ -274,6 +274,9 @@ private void parseSource(GetResponse getResponse, final BoolQueryBuilder boolBui
docMapper.parse(SourceToParse.source(getResponse.getSourceAsBytesRef()).type(request.type()).id(request.id()), new DocumentMapper.ParseListenerAdapter() {
@Override
public boolean beforeFieldAdded(FieldMapper fieldMapper, Field field, Object parseContext) {
if (!field.fieldType().indexed()) {
return false;
}
if (fieldMapper instanceof InternalMapper) {
return true;
}
Expand Down
Expand Up @@ -24,6 +24,7 @@
import org.apache.lucene.util.BytesRef;
import org.elasticsearch.common.Numbers;
import org.elasticsearch.index.mapper.internal.UidFieldMapper;
import org.elasticsearch.index.mapper.internal.VersionFieldMapper;

import java.io.IOException;
import java.util.List;
Expand Down Expand Up @@ -95,7 +96,7 @@ public static DocIdAndVersion loadDocIdAndVersion(AtomicReaderContext readerCont
}

// Versions are stored as doc values...
final NumericDocValues versions = reader.getNumericDocValues(UidFieldMapper.VERSION);
final NumericDocValues versions = reader.getNumericDocValues(VersionFieldMapper.NAME);
if (versions != null || !terms.hasPayloads()) {
// only the last doc that matches the _uid is interesting here: if it is deleted, then there is
// no match otherwise previous docs are necessarily either deleted or nested docs
Expand Down
71 changes: 63 additions & 8 deletions src/main/java/org/elasticsearch/index/codec/CodecModule.java
Expand Up @@ -27,6 +27,10 @@
import org.elasticsearch.common.inject.multibindings.MapBinder;
import org.elasticsearch.common.settings.NoClassSettingsException;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.index.codec.docvaluesformat.DocValuesFormatProvider;
import org.elasticsearch.index.codec.docvaluesformat.DocValuesFormatService;
import org.elasticsearch.index.codec.docvaluesformat.DocValuesFormats;
import org.elasticsearch.index.codec.docvaluesformat.PreBuiltDocValuesFormatProvider;
import org.elasticsearch.index.codec.postingsformat.PostingFormats;
import org.elasticsearch.index.codec.postingsformat.PostingsFormatProvider;
import org.elasticsearch.index.codec.postingsformat.PostingsFormatService;
Expand All @@ -35,9 +39,9 @@
import java.util.Map;

/**
* The {@link CodecModule} creates and loads the {@link CodecService} and
* {@link PostingsFormatService} allowing low level data-structure
* specialization on a Lucene Segment basis.
* The {@link CodecModule} creates and loads the {@link CodecService},
* {@link PostingsFormatService} and {@link DocValuesFormatService},
* allowing low level data-structure specialization on a Lucene Segment basis.
* <p>
* The codec module is the authoritative source for build-in and custom
* {@link PostingsFormatProvider}. During module bootstrap it processes the
Expand Down Expand Up @@ -72,21 +76,25 @@ public class CodecModule extends AbstractModule {

private final Settings indexSettings;

private Map<String, Class<? extends PostingsFormatProvider>> customProviders = Maps.newHashMap();
private final Map<String, Class<? extends PostingsFormatProvider>> customPostingsFormatProviders = Maps.newHashMap();
private final Map<String, Class<? extends DocValuesFormatProvider>> customDocValuesFormatProviders = Maps.newHashMap();

public CodecModule(Settings indexSettings) {
this.indexSettings = indexSettings;
}

public CodecModule addPostingFormat(String name, Class<? extends PostingsFormatProvider> provider) {
this.customProviders.put(name, provider);
this.customPostingsFormatProviders.put(name, provider);
return this;
}

@Override
protected void configure() {
public CodecModule addDocValuesFormat(String name, Class<? extends DocValuesFormatProvider> provider) {
this.customDocValuesFormatProviders.put(name, provider);
return this;
}

Map<String, Class<? extends PostingsFormatProvider>> postingFormatProviders = Maps.newHashMap(customProviders);
private void configurePostingsFormats() {
Map<String, Class<? extends PostingsFormatProvider>> postingFormatProviders = Maps.newHashMap(customPostingsFormatProviders);

Map<String, Settings> postingsFormatsSettings = indexSettings.getGroups(PostingsFormatProvider.POSTINGS_FORMAT_SETTINGS_PREFIX);
for (Map.Entry<String, Settings> entry : postingsFormatsSettings.entrySet()) {
Expand Down Expand Up @@ -123,6 +131,53 @@ protected void configure() {
}

bind(PostingsFormatService.class).asEagerSingleton();
}

private void configureDocValuesFormats() {
Map<String, Class<? extends DocValuesFormatProvider>> docValuesFormatProviders = Maps.newHashMap(customDocValuesFormatProviders);

Map<String, Settings> docValuesFormatSettings = indexSettings.getGroups(DocValuesFormatProvider.DOC_VALUES_FORMAT_SETTINGS_PREFIX);
for (Map.Entry<String, Settings> entry : docValuesFormatSettings.entrySet()) {
final String name = entry.getKey();
final Settings settings = entry.getValue();

final String sType = settings.get("type");
if (sType == null || sType.trim().isEmpty()) {
throw new ElasticSearchIllegalArgumentException("DocValuesFormat Factory [" + name + "] must have a type associated with it");
}

final Class<? extends DocValuesFormatProvider> type;
try {
type = settings.getAsClass("type", null, "org.elasticsearch.index.codec.docvaluesformat.", "DocValuesFormatProvider");
} catch (NoClassSettingsException e) {
throw new ElasticSearchIllegalArgumentException("The specified type [" + sType + "] for docValuesFormat Factory [" + name + "] can't be found");
}
docValuesFormatProviders.put(name, type);
}

// now bind
MapBinder<String, DocValuesFormatProvider.Factory> docValuesFormatFactoryBinder
= MapBinder.newMapBinder(binder(), String.class, DocValuesFormatProvider.Factory.class);

for (Map.Entry<String, Class<? extends DocValuesFormatProvider>> entry : docValuesFormatProviders.entrySet()) {
docValuesFormatFactoryBinder.addBinding(entry.getKey()).toProvider(FactoryProvider.newFactory(DocValuesFormatProvider.Factory.class, entry.getValue())).in(Scopes.SINGLETON);
}

for (PreBuiltDocValuesFormatProvider.Factory factory : DocValuesFormats.listFactories()) {
if (docValuesFormatProviders.containsKey(factory.name())) {
continue;
}
docValuesFormatFactoryBinder.addBinding(factory.name()).toInstance(factory);
}

bind(DocValuesFormatService.class).asEagerSingleton();
}

@Override
protected void configure() {
configurePostingsFormats();
configureDocValuesFormats();

bind(CodecService.class).asEagerSingleton();
}
}

0 comments on commit 4fa8f6f

Please sign in to comment.