Reuse Lucene's TermsEnum for faster _uid/version lookup during indexing #6298

mikemccand · 2014-05-23T15:13:43Z

The TermsEnums used for lookup have highish cost to init, so if we
reuse them we may be able to stop using bloom filters. I ran some bulk
update performance tests, showing that turning off blooms and reusing
the enums gets close to the same performance as master (using blooms
and not reusing the enums).

Closes #6212

…indexing The TermsEnums used for lookup have highish cost to init, so if we reuse them we may be able to stop using bloom filters. I ran some bulk update performance tests, showing that turning off blooms and reusing the enums gets close to the same performance as master (using blooms and not reusing the enums). Closes elastic#6212

nik9000 · 2014-05-23T15:28:47Z

src/main/java/org/elasticsearch/common/lucene/uid/PerThreadIDAndVersionLookup.java

+        List<AtomicReaderContext> leaves = new ArrayList<>(r.leaves());
+
+        // nocommit but ES goes backwards today... is that really best?  backwards is not necessarily reverse time order (TMP merges out of
+        // order)


It looks like backwards in this context means smallest to largest?

Roughly, yes, it goes smallest to largest, but with the default TieredMergePolicy, the segment sizes can vary a lot over time (it doesn't care about / preserve segment order in the index).

I think the idea was also that it is frequent to have a lots of documents that are pretty static and a few ones that are frequently updated, and these frequently updated documents would more likely be in the last segments?

OK, I'll switch it to go backwards again.

mikemccand · 2014-05-28T15:30:35Z

I ran a bulk indexing perf test comparing master vs this patch, updating 10M small log-entry type docs ("index" command, passing _id), with random UUIDs (%10d, worst case for terms dict), using MMapDir, and the results look promising: reusing the TermsEnum gets back much of the performance that bloom filters buy us today.

However, this is a small test (10M docs), the index was fully hot...

jpountz · 2014-05-30T06:56:03Z

src/main/java/org/elasticsearch/index/codec/postingsformat/PostingFormats.java

@@ -67,7 +67,9 @@
        for (String luceneName : PostingsFormat.availablePostingsFormats()) {
            buildInPostingFormatsX.put(luceneName, new PreBuiltPostingsFormatProvider.Factory(PostingsFormat.forName(luceneName)));
        }
-        final Elasticsearch090PostingsFormat defaultFormat = new Elasticsearch090PostingsFormat();
+        // nocommit can we disable bloom by default


It makes sense to me given the minor speedup they bring. But maybe as a separate change?

+1 for separate issue, I'll open it.

I want to test a larger index, and a cold index first. Separately I'm going to test switching to the new IDVPostingsFormat.

I opened #6349

mikemccand · 2014-05-30T17:19:33Z

OK I went back to searching segments in reverse order, and downgraded the other nocommits to TODOs. I think this is ready; I'll commit soon.

jpountz · 2014-05-30T17:46:10Z

src/main/java/org/elasticsearch/index/codec/postingsformat/PostingFormats.java

@@ -67,7 +67,8 @@
        for (String luceneName : PostingsFormat.availablePostingsFormats()) {
            buildInPostingFormatsX.put(luceneName, new PreBuiltPostingsFormatProvider.Factory(PostingsFormat.forName(luceneName)));
        }
-        final Elasticsearch090PostingsFormat defaultFormat = new Elasticsearch090PostingsFormat();
+        final PostingsFormat defaultFormat = new Elasticsearch090PostingsFormat();
+        //final PostingsFormat defaultFormat = PostingsFormat.forName("Lucene41");


Can you just remove this commented line before pushing?

OK will do!

mikemccand · 2014-05-31T21:44:09Z

Committed with #6212

This commit changes the default for index.codec.bloom.load to false, because bloom filters can use a sizable amount of RAM on indices with many tiny documents, and now only gives smallish index-time performance gains for apps that update (not just append) documents, since we've separately improved performance for ID lookups with elastic#6298. Closes elastic#6349

This change just changes the default for index.codec.bloom.load to false: with recent performance improvements to ID lookup, such as #6298, bloom filters don't give much of a performance gain anymore, and they can consume non-trivial RAM when there are many tiny documents. For now, we still index the bloom filters, so if a given app wants them back, it can just update the index.codec.bloom.load to true. Closes #6959

nik9000 reviewed May 23, 2014
View reviewed changes

jpountz reviewed May 30, 2014
View reviewed changes

mikemccand mentioned this pull request May 30, 2014

Turn off bloom filters on _uid by default #6349

Closed

remove nocommits; switch back to searching segments in reverse order

256b1d4

jpountz reviewed May 30, 2014
View reviewed changes

remove commented out line

82de5b7

mikemccand closed this May 31, 2014

mikemccand added v1.3.0 labels Jun 20, 2014

clintongormley changed the title ~~Core: reuse Lucene's TermsEnum for faster _uid/version lookup during indexing~~ Indexing: Reuse Lucene's TermsEnum for faster _uid/version lookup during indexing Jul 16, 2014

clintongormley added the enhancement label Jul 16, 2014

mikemccand mentioned this pull request Jul 22, 2014

Disable loading of bloom filters by default #6959

Closed

clintongormley added the :Core/Infra/Core Core issues without another label label Jun 7, 2015

clintongormley changed the title ~~Indexing: Reuse Lucene's TermsEnum for faster _uid/version lookup during indexing~~ Reuse Lucene's TermsEnum for faster _uid/version lookup during indexing Jun 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reuse Lucene's TermsEnum for faster _uid/version lookup during indexing #6298

Reuse Lucene's TermsEnum for faster _uid/version lookup during indexing #6298

mikemccand commented May 23, 2014

nik9000 May 23, 2014

mikemccand May 23, 2014

jpountz May 30, 2014

mikemccand May 30, 2014

mikemccand commented May 28, 2014

jpountz May 30, 2014

mikemccand May 30, 2014

mikemccand May 30, 2014

mikemccand commented May 30, 2014

jpountz May 30, 2014

mikemccand May 30, 2014

mikemccand commented May 31, 2014

Reuse Lucene's TermsEnum for faster _uid/version lookup during indexing #6298

Reuse Lucene's TermsEnum for faster _uid/version lookup during indexing #6298

Conversation

mikemccand commented May 23, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikemccand commented May 28, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikemccand commented May 30, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikemccand commented May 31, 2014