-
Notifications
You must be signed in to change notification settings - Fork 82
Home
Important: This experiment is currently not maintained at GitHub. New code can be found at https://issues.apache.org/jira/browse/LUCENE-2369
This is a forked repository, created for experimentation with low memory overhead sorting, faceting and index lookup. See apache’s lucene wiki for information on the main repository.
The idea is to expose the inner ordinals for terms in the SegmentReader, trading String access speed for lower memory footprint. A detailed blogpost on this can be found at sbdevel.wordpress.com and a JIRA-ussue at LUCENE-2369 (formerly 2335).
Currently startup time is about 6 minutes for 10 million terms on a laptop with i7 processor, using a conventional harddisk. On the plus side, once initialized, sorts are significantly faster than Lucene’s default and the memory overhead for the structure is (#sortterms*log2(#sortterms) + #documents*log2(#sortterms) + #documents*log2(#terms)) / 8 bytes
. For an index of 20 million documents and 10 million unique terms in the sort field plus 100 million unique terms in total, this is 90MB. A temporary overhead of #documents*8
bytes is needed for building the structures.
Test it by running
java-cp lucene-core-3.1-dev-LUCENE-2335-20100405.jar org.apache.lucene.index.ExposedPOC