Skip to content

PointerFLY/Lucene-Example

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lucene-Example

A simple document search engine powered by Lucene.

How to run

Gradle is used thoroughly in this project. Check out build.gradle by yourself if you have knowledge on Gradle.
I recommend open it with IDE who supports Gradle, such as Intellij IDEA.

# Make sure gradle is installed.
git clone https://github.com/PointerFLY/Lucene-Example.git
cd Lucene-Example
gradle run

Class Structure

Class Functionality
FileUtils Create local folder structures, automatically download and decompress Cranfield files. Provide access to all necessary files (Cranfield files, index files).
FileParser A parser used to parse and load Cranfield files into a Java supported data structre.
DocumentModel A simple object-oriented way to represent Cranfield documents.
Indexer Responsible to create all indices.
Searcher Read indices created by Indexer, parse the query string and exectute search. A search will return top hits documents ids.
CustomAnalyzer A analyzer created by myself to compare its performances with other built-in Analyzer in Lucene.
Evaluator Calculate Mean Average Precision, Recall.
Main Entry of program, run different analyzers and scoring models and print results.

Analyzer

  • StandardAnalyzer
    Built-in Analyzer in Lucene, the most robust one. Filters StandardTokenizer with StandardFilter, LowerCaseFilter and StopFilter, using a list of English stop words.

  • CustomAnalyzer I select differnt tokenizer and filters, combine them, fine-tune the pipeline. Finally choose a analysis pipeline below:

public class CustomAnalyzer extends Analyzer {

    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
        Tokenizer source = new ClassicTokenizer();
        TokenStream tokenStream = new LowerCaseFilter(source);
        tokenStream = new EnglishPossessiveFilter(tokenStream);
        tokenStream = new EnglishMinimalStemFilter(tokenStream);
        tokenStream = new KStemFilter(tokenStream);
        tokenStream = new PorterStemFilter(tokenStream);
        CharArraySet stopSet = CharArraySet.copy(StopAnalyzer.ENGLISH_STOP_WORDS_SET);
        tokenStream = new StopFilter(tokenStream, stopSet);
        return new TokenStreamComponents(source, tokenStream);
    }
}

Scoring Model

I choose two built-in Lucene scoring model, which is subclasses of Similarity in Lucene.

  • ClassicSimilarity
    A implementation of TFIDFSimilarity. It is actually a Vector Sapce Model base on TF-IDF term weighting.

  • BM25Similarity
    A TF-IDF-like ranking function used in document retrieval, which is generally considered superior to TF-IDF.

Performance

Note: MAP and recall are average of 225 queries.

N/A StandardAnalyzer CustomAnalyzer
Vector Space Model MAP: 0.4905 Recall: 0.6445 MAP: 0.5038 Recall: 0.6839
BM25 Model MAP: 0.5418 Recall: 0.6826 MAP: 0.5435 Recall: 0.7233

Conclusion:

  • BM25 Model outperforms Vector Space Model.
  • CustomAnalyzer outperforms StandardAnalzyer.

About

A simple document search engine powered by Lucene.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages