Try turning off patching in Lucene's PFOR encoding #46

mikemccand · 2023-08-15T14:40:56Z

One difference between Lucene and Tantivy is Lucene uses the "patch" FOR, meaning the large values in a block are held out as exceptions so that the remaining values can use a smaller number of bits to encode, a tradeoff of CPU for lower storage space.

Let's try temporarily disabling the patching in Lucene, to match how Tantivy encodes, to see how much of a performance difference that is costing us?

Tony-X · 2023-08-15T20:13:27Z

++

Is there an easy way to disable it? Based on my code-reading, it does not seem to be configurable

slow-J · 2023-08-25T17:36:27Z

I forked a Test PostingsFormat, PostingsReader, PostingsWriter, codec and forutil to not do the patching for exceptions (large values).

Done a quick test. Want to re-index and confirm again.
The performance difference for Lucene latency is: -2% in COUNT, -2% in TOP_10_COUNT, -2.07% in TOP_100.
It seems to be most improving Queries of AndHighHigh and HighSloppyPhase. MedTerm and LowTerm have regressions.

Done a benchmark (m6g.4xlarge) with COUNT, TOP_10_COUNT, TOP_100 against Lucene with pfor (from previous test after d1b928c) and without the patching in the PFOR. See the test code here: slow-J/lucene@cd68926 (please let me know if there are any improvements I could have made).

To run the code in the benchmark, I ran gradlew jar and copied the core jar to my benchmark workspace and added this to build.gradle: implementation files("./libs/lucene-core-9.7.1-SNAPSHOT.jar"). I added config.setCodec(new LuceneTestCodec(LuceneTestCodec.Mode.BEST_COMPRESSION)); to BuildIndex.java.

Attaching results:
results.json.zip

Tony-X · 2023-09-06T21:32:50Z

Did you get a chance to check the index size impact?

mikemccand · 2023-09-07T11:09:45Z

I also wonder whether removal of patching is more or less impactful on Graviton3?

slow-J · 2023-09-07T12:07:16Z

Re-indexed again to make sure I have the right correct version built.

with patching turned on (baseline):
10.098 GiB idx

with patching turned off:
10.605 GiB idx

So turning off patching causes a +5.0208% increase in the size of the index.

I'll test Graviton3 when I get a chance.

slow-J · 2023-09-07T15:41:23Z

Variables

Graviton3
JDK 17
Same test code as before

Candidate:
COUNT, avg: 11,108 μs
TOP_10_COUNT, avg: 18,652 μs
TOP_100, avg: 14,810 μs

Comparing to baseline results from #36

Baseline:
COUNT, avg: 10,883 μs
TOP_10_COUNT, avg: 19,539 μs
TOP_100, avg: not tested

Changes with turning off the patching in PFOR encoding:
COUNT, avg: +2.06744% latency
TOP_10_COUNT, avg: -4.53964% latency

So the improvement to TOP_10_COUNT is doubled vs Graviton 2.
COUNT on the other hand has regressed.

Attaching results.json.
results.json.zip

mikemccand · 2023-09-12T10:14:13Z

TOP_10_COUNT, avg: -4.53964% latency

This is quite a compelling gain.

Did you turn off patching for both doc and freq? It's odd that COUNT regressed, but it is (should be!) only decoding doc in each postings list.

slow-J · 2023-09-12T18:08:53Z

I turned off all patching in postings (both doc and freq I believe), see slow-J/lucene@cd68926

I will re-run the graviton3 benchmark to double check will the COUNT regression stand.

slow-J · 2023-09-18T13:53:17Z

Ran the Graviton 3 benchmark again.

COUNT, avg: 10,822 μs
TOP_10_COUNT, avg: 19,039 μs
TOP_100, avg: 14,914 μs

So this time compared to the control, we have
COUNT: -0.56%
TOP_10_COUNT: -2.559%

So there is some variance between benchmark runs, and no COUNT regression this time.

slow-J · 2023-11-06T19:33:24Z

The Lucene PR: apache/lucene#12741 has been merged! Resolving!

slow-J self-assigned this Aug 23, 2023

slow-J mentioned this issue Oct 18, 2023

Adding option to codec to disable patching in Lucene's PFOR encoding apache/lucene#12696

Closed

slow-J closed this as completed Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try turning off patching in Lucene's PFOR encoding #46

Try turning off patching in Lucene's PFOR encoding #46

mikemccand commented Aug 15, 2023

Tony-X commented Aug 15, 2023

slow-J commented Aug 25, 2023 •

edited

Tony-X commented Sep 6, 2023

mikemccand commented Sep 7, 2023

slow-J commented Sep 7, 2023

slow-J commented Sep 7, 2023

mikemccand commented Sep 12, 2023

slow-J commented Sep 12, 2023

slow-J commented Sep 18, 2023

slow-J commented Nov 6, 2023

Try turning off patching in Lucene's PFOR encoding #46

Try turning off patching in Lucene's PFOR encoding #46

Comments

mikemccand commented Aug 15, 2023

Tony-X commented Aug 15, 2023

slow-J commented Aug 25, 2023 • edited

Tony-X commented Sep 6, 2023

mikemccand commented Sep 7, 2023

slow-J commented Sep 7, 2023

slow-J commented Sep 7, 2023

mikemccand commented Sep 12, 2023

slow-J commented Sep 12, 2023

slow-J commented Sep 18, 2023

slow-J commented Nov 6, 2023

slow-J commented Aug 25, 2023 •

edited