Skip to content
synhershko edited this page Dec 18, 2012 · 4 revisions

What is HebMorph?

Indexing Hebrew texts for later retrieval is not a trivial task. Although several solutions exist, they are not necessarily providing the best results in terms of relevancy. Either way, there is no freely available solution allowing to index Hebrew even at the very basic level.

HebMorph was started with this in mind. It is a free, open-source effort for making Hebrew properly searchable by various IR software libraries, while maintaining decent recall, precision and relevancy in retrievals. During the work on this project, we will try and come up with different approaches to indexing Hebrew, and provide the tools to perform reliable comparisons between them. This project’s ultimate goal is providing various IR libraries with the best Hebrew IR capabilities possible.

Apache Lucene has been selected to be our planning and testing framework. This is thanks to its advanced capabilities, flexibility, and the author’s familiarity with it. During these initial steps, .NET code is being written and used with Lucene.Net (a .Net port of Java Lucene). Once the project stabilizes enough, ports to other languages will be followed.

More detailed information on why this project is important can be found in a series of 3 blog posts in my blog: Challenges with indexing Hebrew texts, Finding Hebrew lemmas and Open-source Hebrew information retrieval. The project’s roadmap is in the last part.

More updates can be found in the project’s page here: http://www.code972.com/blog/hebmorph/

Think-tank mailing list for discussion and planning: https://lists.sourceforge.net/lists/listinfo/hebmorph-thinktank

License

HebMorph is copyright © 2010, Itamar Syn-Hershko.

HebMorph relies on Hspell, copyright © 2000-2010, Nadav Har’El and Dan Kenigsberg (http://hspell.ivrix.org.il/).

All code, data and dictionary files are released to the public licensed under the GNU Affero General Public License (AGPL) version 3. See the COPYING file included in this distribution for the whole text of the GNU Affero General Public License version 3.

Note that not only the programs in the distribution, but also the dictionary files and the generated word lists, are licensed under the same license.

There is no warranty of any kind for the contents of this distribution.

If you are interested in using this product commercially or without releasing your source-code, please contact the author.

Clone this wiki locally