lucene-stanford-lemmatizer

This is a library that adds NLP capabilities to Lucene-based search engines: lemmatization and filtering based on part-of-speech (POS) tag. It used the state-of-the-art Stanford POS Tagger for NLP support.

Lemmatizing is similar to stemming, except smarter: it takes into account the context of a word to determine the correct lemma/stem. POS filtering is a smarter replacement for stop lists. It allows filtering out all pronouns, adverbs, etc.

For lemmatization and POS tagging to work best, your queries should be English sentences instead of just bunches of keywords.

Getting started

Download this package and

Set your CLASSPATH to include the above, then issue ant jar.

In your search code, construct an EnglishLemmaAnalyzer instead of a StandardAnalyzer (or whatever you normally use). Pass the filename of a Stanford POS Tagger model file to the constructor (found in the models/ directory in the Stanford POS Tagger source directory.

Going further

It is possible to determine which parts-of-speech should be indexed by subclassing the tokenizer. See the API docs for details.

Bugs

Lucene 4.x support is missing. Please don't email me (Lars) about this; I don't have the time to learn the new APIs and fix it. If you know a fix, please fork this project and publish your changes.

The implementation is limited to English, because the Stanford lemmatizer only handles that languages. The POS tagger does Chinese and German, so it should be possible to add those languages.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
src/nl/rug/eco/lucene		src/nl/rug/eco/lucene
.gitignore		.gitignore
COPYING		COPYING
README.markdown		README.markdown
build.xml		build.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src/nl/rug/eco/lucene

src/nl/rug/eco/lucene

.gitignore

.gitignore

COPYING

COPYING

README.markdown

README.markdown

build.xml

build.xml

Repository files navigation

lucene-stanford-lemmatizer

Getting started

Going further

Bugs

About

Releases

Packages

Languages

License

larsmans/lucene-stanford-lemmatizer

Folders and files

Latest commit

History

Repository files navigation

lucene-stanford-lemmatizer

Getting started

Going further

Bugs

About

Resources

License

Stars

Watchers

Forks

Languages