Skip to content

Latest commit

 

History

History
21 lines (15 loc) · 1.32 KB

lucene-index.msmarco-v2-passage-full.20220111.06fb4f.README.md

File metadata and controls

21 lines (15 loc) · 1.32 KB

msmarco-v2-passage-full

Lucene index of the MS MARCO V2 passage corpus.

This index was generated on 2022/01/11 at Anserini commit 06fb4f on orca with the following command:

target/appassembler/bin/IndexCollection -collection MsMarcoV2PassageCollection \
  -generator DefaultLuceneDocumentGenerator -threads 18 \
  -input /store/collections/msmarco/msmarco_v2_passage/ \
  -index indexes/lucene-index.msmarco-v2-passage-full.20220111.06fb4f/ \
  -storePositions -storeDocvectors -storeRaw -optimize

Note that there are three variants of this index:

  • msmarco-v2-passage (45G uncompressed): the "default" version, which stores term frequencies and the raw text. This supports bag-of-words queries, but no phrase queries and no relevance feedback.
  • msmarco-v2-passage-slim (11G uncompressed): the "slim" version, which stores term frequencies only. This supports bag-of-words queries, but no phrase queries and no relevance feedback. There is no way to fetch the raw text from this index.
  • msmarco-v2-passage-full (69G uncompressed): the "full" version, which stores term frequencies, term positions, document vectors, and the raw text. This supports bag-of-words queries, phrase queries, and relevance feedback.

This is the "full" version.