New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reuse Lucene's TermsEnum for faster _uid/version lookup during indexing #6298
Conversation
…indexing The TermsEnums used for lookup have highish cost to init, so if we reuse them we may be able to stop using bloom filters. I ran some bulk update performance tests, showing that turning off blooms and reusing the enums gets close to the same performance as master (using blooms and not reusing the enums). Closes elastic#6212
List<AtomicReaderContext> leaves = new ArrayList<>(r.leaves()); | ||
|
||
// nocommit but ES goes backwards today... is that really best? backwards is not necessarily reverse time order (TMP merges out of | ||
// order) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like backwards in this context means smallest to largest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Roughly, yes, it goes smallest to largest, but with the default TieredMergePolicy, the segment sizes can vary a lot over time (it doesn't care about / preserve segment order in the index).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea was also that it is frequent to have a lots of documents that are pretty static and a few ones that are frequently updated, and these frequently updated documents would more likely be in the last segments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll switch it to go backwards again.
I ran a bulk indexing perf test comparing master vs this patch, updating 10M small log-entry type docs ("index" command, passing _id), with random UUIDs (%10d, worst case for terms dict), using MMapDir, and the results look promising: reusing the TermsEnum gets back much of the performance that bloom filters buy us today. However, this is a small test (10M docs), the index was fully hot... |
@@ -67,7 +67,9 @@ | |||
for (String luceneName : PostingsFormat.availablePostingsFormats()) { | |||
buildInPostingFormatsX.put(luceneName, new PreBuiltPostingsFormatProvider.Factory(PostingsFormat.forName(luceneName))); | |||
} | |||
final Elasticsearch090PostingsFormat defaultFormat = new Elasticsearch090PostingsFormat(); | |||
// nocommit can we disable bloom by default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense to me given the minor speedup they bring. But maybe as a separate change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for separate issue, I'll open it.
I want to test a larger index, and a cold index first. Separately I'm going to test switching to the new IDVPostingsFormat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #6349
OK I went back to searching segments in reverse order, and downgraded the other nocommits to TODOs. I think this is ready; I'll commit soon. |
@@ -67,7 +67,8 @@ | |||
for (String luceneName : PostingsFormat.availablePostingsFormats()) { | |||
buildInPostingFormatsX.put(luceneName, new PreBuiltPostingsFormatProvider.Factory(PostingsFormat.forName(luceneName))); | |||
} | |||
final Elasticsearch090PostingsFormat defaultFormat = new Elasticsearch090PostingsFormat(); | |||
final PostingsFormat defaultFormat = new Elasticsearch090PostingsFormat(); | |||
//final PostingsFormat defaultFormat = PostingsFormat.forName("Lucene41"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just remove this commented line before pushing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK will do!
Committed with #6212 |
This commit changes the default for index.codec.bloom.load to false, because bloom filters can use a sizable amount of RAM on indices with many tiny documents, and now only gives smallish index-time performance gains for apps that update (not just append) documents, since we've separately improved performance for ID lookups with elastic#6298. Closes elastic#6349
This change just changes the default for index.codec.bloom.load to false: with recent performance improvements to ID lookup, such as #6298, bloom filters don't give much of a performance gain anymore, and they can consume non-trivial RAM when there are many tiny documents. For now, we still index the bloom filters, so if a given app wants them back, it can just update the index.codec.bloom.load to true. Closes #6959
This change just changes the default for index.codec.bloom.load to false: with recent performance improvements to ID lookup, such as #6298, bloom filters don't give much of a performance gain anymore, and they can consume non-trivial RAM when there are many tiny documents. For now, we still index the bloom filters, so if a given app wants them back, it can just update the index.codec.bloom.load to true. Closes #6959
The TermsEnums used for lookup have highish cost to init, so if we
reuse them we may be able to stop using bloom filters. I ran some bulk
update performance tests, showing that turning off blooms and reusing
the enums gets close to the same performance as master (using blooms
and not reusing the enums).
Closes #6212