Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse Lucene's TermsEnum for faster _uid/version lookup during indexing #6298

Closed
wants to merge 3 commits into from

Conversation

mikemccand
Copy link
Contributor

The TermsEnums used for lookup have highish cost to init, so if we
reuse them we may be able to stop using bloom filters. I ran some bulk
update performance tests, showing that turning off blooms and reusing
the enums gets close to the same performance as master (using blooms
and not reusing the enums).

Closes #6212

…indexing

The TermsEnums used for lookup have highish cost to init, so if we
reuse them we may be able to stop using bloom filters.  I ran some bulk
update performance tests, showing that turning off blooms and reusing
the enums gets close to the same performance as master (using blooms
and not reusing the enums).

Closes elastic#6212
List<AtomicReaderContext> leaves = new ArrayList<>(r.leaves());

// nocommit but ES goes backwards today... is that really best? backwards is not necessarily reverse time order (TMP merges out of
// order)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like backwards in this context means smallest to largest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roughly, yes, it goes smallest to largest, but with the default TieredMergePolicy, the segment sizes can vary a lot over time (it doesn't care about / preserve segment order in the index).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea was also that it is frequent to have a lots of documents that are pretty static and a few ones that are frequently updated, and these frequently updated documents would more likely be in the last segments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll switch it to go backwards again.

@mikemccand
Copy link
Contributor Author

bulkindexing

I ran a bulk indexing perf test comparing master vs this patch, updating 10M small log-entry type docs ("index" command, passing _id), with random UUIDs (%10d, worst case for terms dict), using MMapDir, and the results look promising: reusing the TermsEnum gets back much of the performance that bloom filters buy us today.

However, this is a small test (10M docs), the index was fully hot...

@@ -67,7 +67,9 @@
for (String luceneName : PostingsFormat.availablePostingsFormats()) {
buildInPostingFormatsX.put(luceneName, new PreBuiltPostingsFormatProvider.Factory(PostingsFormat.forName(luceneName)));
}
final Elasticsearch090PostingsFormat defaultFormat = new Elasticsearch090PostingsFormat();
// nocommit can we disable bloom by default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to me given the minor speedup they bring. But maybe as a separate change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for separate issue, I'll open it.

I want to test a larger index, and a cold index first. Separately I'm going to test switching to the new IDVPostingsFormat.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #6349

@mikemccand
Copy link
Contributor Author

OK I went back to searching segments in reverse order, and downgraded the other nocommits to TODOs. I think this is ready; I'll commit soon.

@@ -67,7 +67,8 @@
for (String luceneName : PostingsFormat.availablePostingsFormats()) {
buildInPostingFormatsX.put(luceneName, new PreBuiltPostingsFormatProvider.Factory(PostingsFormat.forName(luceneName)));
}
final Elasticsearch090PostingsFormat defaultFormat = new Elasticsearch090PostingsFormat();
final PostingsFormat defaultFormat = new Elasticsearch090PostingsFormat();
//final PostingsFormat defaultFormat = PostingsFormat.forName("Lucene41");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just remove this commented line before pushing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK will do!

@mikemccand
Copy link
Contributor Author

Committed with #6212

@mikemccand mikemccand closed this May 31, 2014
@clintongormley clintongormley changed the title Core: reuse Lucene's TermsEnum for faster _uid/version lookup during indexing Indexing: Reuse Lucene's TermsEnum for faster _uid/version lookup during indexing Jul 16, 2014
mikemccand added a commit to mikemccand/elasticsearch that referenced this pull request Jul 22, 2014
This commit changes the default for index.codec.bloom.load to false,
because bloom filters can use a sizable amount of RAM on indices with
many tiny documents, and now only gives smallish index-time
performance gains for apps that update (not just append) documents,
since we've separately improved performance for ID lookups with
elastic#6298.

Closes elastic#6349
mikemccand added a commit that referenced this pull request Jul 23, 2014
This change just changes the default for index.codec.bloom.load to
false: with recent performance improvements to ID lookup, such as
#6298, bloom filters don't give much of a performance gain anymore,
and they can consume non-trivial RAM when there are many tiny
documents.

For now, we still index the bloom filters, so if a given app wants
them back, it can just update the index.codec.bloom.load to true.

Closes #6959
mikemccand added a commit that referenced this pull request Jul 23, 2014
This change just changes the default for index.codec.bloom.load to
false: with recent performance improvements to ID lookup, such as
#6298, bloom filters don't give much of a performance gain anymore,
and they can consume non-trivial RAM when there are many tiny
documents.

For now, we still index the bloom filters, so if a given app wants
them back, it can just update the index.codec.bloom.load to true.

Closes #6959
@clintongormley clintongormley added the :Core/Infra/Core Core issues without another label label Jun 7, 2015
@clintongormley clintongormley changed the title Indexing: Reuse Lucene's TermsEnum for faster _uid/version lookup during indexing Reuse Lucene's TermsEnum for faster _uid/version lookup during indexing Jun 7, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Indexing: Versions.loadDocIdAndVersion should reuse TermsEnums
4 participants