GitHub - sauparna/Lucene: A mod of Lucene-5.4.0 for processing TREC data.

sauparna / Lucene Public

forked from isoboroff/trec-demo

Notifications You must be signed in to change notification settings
Fork 0
Star 0

A mod of Lucene-5.4.0 for processing TREC data.

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 142 Commits
example		example
src/main/java		src/main/java
NOTES.txt		NOTES.txt
README.txt		README.txt
models		models
pom.xml		pom.xml

Repository files navigation

Lucene Mod

Sauparna Palchowdhury
sauparna.palc [at] gmail [dot] com

Henry Feild
hfeild [at] endicott [dot] edu

----------------------------------------------------------------------
DESCRIPTION

This is Lucene (5.4.0) with some modifications for processing TREC
test-collections. The way to use it is to pass to 'IndexTREC', at the
command-line, a TREC document corpus to index, and then retrieve
documents with 'BatchSearch' and a set of queries as input.

To run the commands described below you will need the sample TREC data: http://sauparna.sdf.org/Search/.files/ap.tgz

----------------------------------------------------------------------
COMPILING

Type "mvn package" in a shell. Tested with Maven 3.0.5 and 3.5.0.

----------------------------------------------------------------------
INDEXING

java -cp "/x/LTR/lib/*" IndexTREC -settings settings.hjson \
-index AP \
-docs ap/AP \
-stop ap/ser17.txt \
-stem PorterStemFilter

(Additional settings exist; any setting that is valid in the settings file
can be provided on the command line with a '-' prefix.)

AP - The string passed as -index is a directory where Lucene will
write the index.

ap/AP - This is a directory. In the sample test-collection ap.txt is
the only file in the corpus and it has been placed inside a directory
named 'AP' because Lucene expects a path to a directory to look for a
corpus in.

----------------------------------------------------------------------
RETRIEVAL

java -cp "/x/LTR/lib/*" BatchSearch -settings settings.hjson \
-index AP \
-queries ap/query-l.txt \
-similarity BM25Similarity \
-stop ap/ser17.txt \
-stem PorterStemFilter

(Additional settings exist; any setting that is valid in the settings file
can be provided on the command line with a '-' prefix.)

ap/query-l.txt - A plain text file containing formatted TREC
queries. Each query is enclosed in a <TOP> tag and the text is placed
within a <TEXT> tag. It was necessary to normalize the formatting
because the older (early 1990's) TREC queries used a different
structure and building this intelligence into Lucene would require
more work.

The query is pre-processed to this format:

<TOP>
<NUM>301</NUM>
<TEXT>
hello world
</TEXT>
<TOP>

Section E, 'PRE-PROCESSING TREC QUERIES', in TRECBOX's documentation
shows how to covert TREC queries to this format:

http://kak.tx0.org/IR/.trecbox/README.txt

----------------------------------------------------------------------
SETTINGS FILE

All settings can be provided in a settings file following HJSON, which is
a more human-friendly version of JSON. See example/settings.hjson for an
example. The available settings are listed below.

Note: all paths can be absolute or relative to where LTR will be invoked.

Indexing + retrieval options:

indexPath -- The path to where the index is located (when querying) or
placed (when indexing).

stopFile -- The path to the stop word list file to use during indexing
or retrieval. Use "None" if no stopping should be performed
(default).
tokenizer -- One of two preset options, or a fully qualified class:
WhitespaceTokenizer (default) -- delimits tokens based
on whitespace; this is an alias for
org.apache.lucene.analysis.core.WhitespaceTokenizer
ClassicTokenizer -- delmits tokens based on whitespace
and punctuation (which is removed, with some
exceptions). This is an alias for
org.apache.lucene.analysis.standard.ClassicTokenizer
... -- a fully qualified Tokenizer class. This must have
a default constructor.

stemmer -- The name of the stemmer to use during indexing or retrieval.
See NOTES.txt for a list of available stemmers. Set to
"None" to turn stemming off (default).

Indexing only options:

docsPath -- Indexing only. The path to the directory containing the
corpus to index. This may contain uncompressed or compressed
(gzip or bzip2) files. Extensions will determine the parser
to use. See NOTES.txt for more information about document
formats.

storeFields -- If set to false (default), fields other than docno will be
indexed, but not stored. Set to true in order to have the
option of including snippets in retrieval results.

warcFieldsToIndex
-- A list of fields to index from WARC documents. Use
"contents" to specify all document text (excluding tags).
If the list is empty, "contents" is assumed.

trecFieldsToIndex
-- Similar to warcFieldsToIndex, but for TREC text and web
documents.

memory -- The amount of memory (in MiB) to use for the indexing.
Defaults to 4096.

Retrieval only options:

searchField -- The field to search. Defaults to "contents".

similarity -- The retrieval model to use. See NOTES.txt for a full
listing of available retrieval models.

queryFile -- The path to the file containing all the queries to run in
batch search. See RETRIEVAL above for a description of
the format of this file.

returnedResultCount
-- Defaults to 1000. Set this to the maximum number of
documents that should be returned for a query.

includeSnippets
-- Default is false. If true, snippets will be included
with results. In order to use this, the index must have
been indexed with the storeFields option set to true.

maxSnippetFragments
-- The number of sentence fragments to include in the snippets.
Defaults to 4.

----------------------------------------------------------------------
EXAMPLES

The examples/ directory contains an example setup, including:

corpus/ -- A directory containing two simple example documents,
one each in the TREC and WARC formats.
queries.txt -- An example query file.
settings.hjson -- A settings file for indexing and retrieval.
stop.txt -- An example stop file (don't use this for anything real;
it only contains three stop words).

To index the example, do:

cd example
java -cp "../lib/*" IndexTREC -settings settings.hjson

This will create a directory called index/, which contains the Lucene index. If
you would like to see statistics about the index, check out the CLue command
line tool on GitHub (https://github.com/javasoze/clue).

To run the example retrieval, do the following from the example directory:

java -cp "../lib/*" BatchSearch -settings settings.hjson

Try playing with the values in the settings file, or overriding values on the
command line to see how things work.