Skip to content
A comprehensive list of Hebrew NLP resources.
Java M4
Branch: master
Clone or download
Latest commit d3e2929 Aug 7, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
code/VerbInflector Create LICENSE.txt May 24, 2018
linguistic_resources Update README.rst May 10, 2019
methodology/hebrew_named_entity_tagging_guidelines removed unneeded files May 24, 2018
Industry.rst Update Industry.rst Jun 1, 2019
LICENSE.txt Create LICENSE.txt May 24, 2018
README.rst Update README.rst Aug 7, 2019

README.rst

Hebrew NLP Resources

This repository collects resources for NLP in Hebrew, as part of the NLPH project, which you can read more about here. Resources are divided to folders by type. If you have a resource you can contribute, to be released under some open license, please submit a pull request, or contact us at contact@nlph.org.il. See here for a list of companies operating in the field.

This specific document is meant to be a list of Hebrew NLP resources, both for general use and to be used as reference when discussing what existing tools can be opened, adapted or integrated to help create a good open source foundation for NLP in Hebrew, as part of the NLPH Project.

When contributing to the list, please add a link to the license for all non-paper resources, e.g. {AGPL-3.0}, {?} for an unkonwn licesnse or {X} for unreleased/closed/copyrighted resources. For code resource, please also add the main language in which the tool is written, e.g. [Python] or [?] for an unknown programming language. Please add hosting mirrors with pointy brackets, e.g. <Zenodo mirror>.

1   Corpora

1.1   Structured Corpora

  • The MILA corpora collection {GPLv3} - The MILA center has 20 different corpora available for free for non-commercial use. All are available in plain text format, and most have tokenized, morphologically-analyzed, and morphologically-disambiguated versions available too.
  • Hebrew Wikipedia dumps {CC-BY-SA 3.0} - Wikipedia, the free encyclopedia, publishes dumps of its content as XML files on a monthly basis.
  • שתי שקל {?} - Wikiproject for correcting grammar mistakes. (Heuristic) positive annotions can be derived from query.
  • Hebrew Speech Databases (HSD) {?} - The HSD contains several Hebrew Speech Databases designed for the development and evaluation of Hebrew Speech Recognition Systems.
  • CoSIH - The Corpus of Spoken Hebrew {?} - The Corpus of Spoken Israeli Hebrew (CoSIH) is a database of recordings of spoken Israeli Hebrew
  • hebrew corpus {?} - HebrewCorpus is a new corpus with 150 million words from NMELRC.
  • The Haifa Corpus of Spoken Hebrew {X} - A corpus of recorded spoken Hebrew and transcripts. Protected under rights reserved to Prof. Yael Maschler.
  • Eran Tomer's Digital Vocalized Text Corpus {Apache License 2.0} - A corpus of digital vocalized Hebrew texts created by Eran Tomer as part of his Master thesis. The corpus is found in the resources folder.
  • The SVLM Hebrew Wikipedia Courpus {CC-BY-SA 3.0} - A corpus of 50K sentences from Hebrew Wikipedia chosen to ensure phoneme coverage for the purpose of a sentence recording project.
  • Knesset 2004-2005 {Public Domain} - A corpus of transcriptions of Knesset (Israeli parliament) meetings between January 2004 and November 2005. Includes tokenized and morphologically tagged versions of most of the documents in the corpus. <MILA> <Zenodo>

1.2   Corpora Sources

  • JPress {Custom Terms of Use} - The National Library offers a collection of Jewish newspapers published in various countries, languages, and time periods, including digital versions and full-text search. The texts are published under a custom Terms of Use document that prohibits commercial use, and additionally requires checking the copyright status and receiving permission from the copyright-holder of the work for any use requiring such permission according to the Copyright Law.
  • DICTA {?} - Analytical tools for Jewish texts. They also have a GitHub organization.
  • Sefaria {Various} - A Living Library of Jewish Texts. 3,000 years of Jewish texts in Hebrew and English translation.
  • HaArchion {?} - Recording of various Hebrew prose and poetry being read.
  • Project Ben Yehuda public dumps {Public Domain} - A repository containing dumps of thousands of public domain works in Hebrew, from Project Ben-Yehuda, in plaintext UTF-8 files, with and without diacritics (nikkud), and in HTML files.
  • ThinkIL {CC-BY-SA 3.0} - An archive of the writings of Zvi Yanai.
  • "Ha'Olam Ha'Ze" Newspaper Archive {?} - An online archive of issues of "Ha'Olam Ha'Ze" ("This World") Israeli newspaper.

2   Linguistic Resources

2.1   Lexicons

  • The BGU morphological lexicon {?} - Is it released?
  • The morphological lexicon of the Israeli National Institute for Testing and Evaluation - Unreleased.
  • The MILA lexicon of Hebrew words {GPLv3} - The lexicon was designed mainly for usage by morphological analyzers, but is being constantly extended to facilitate other applications as well. The lexicon contains about 25,000 lexicon items and is extended regularly. Free for non-commercial use.
  • Hebrew WordNet {GPLv3} - Hebrew WordNet uses the MultiWordNet methodology and is aligned with the one developed at IRST (and therefore is aligned with English, Italian and Spanish). Free for non-commercial use.
  • MILA's Verb Complements Lexicon {GPLv3} - NLPH backup here.

2.2   Dictionaries & Word Lists

2.3   Treebanks

  • The Hebrew Treebank {GPLv3} - The Hebrew Treebank Version 2.0 contains 6500 hand-annotated sentences of news items from the MILA HaAretz Corpus, with full word segmentation and morpho-syntactic analysis. Morphological features that are not directly relevant for syntactic structures, like roots, templates and patterns, are not analyzed. This resource can be used freely for research purposes only.
  • UD Hebrew Treebank {CC BY-NC-SA 4.0} - The Hebrew Universal Dependencies Treebank.
  • Modern Hebrew Dependency Treebank v.1 {GPLv3} - This is the Modern Hebrew Dependency Treebank which was created and used in Yoav Goldberg's PhD thesis.

2.4   Embeddings

2.5   Other

3   Code

Also see here: https://github.com/iddoberger/awesome-hebrew-nlp

3.1   Tokenization

3.2   Morphological and Syntactic Analysis

3.3   Tagging Tools

3.4   Models

3.5   Other

  • Verb Inflector [Java] {Apache License 2.0} - A generation mechanism, created as part of Eran Tomer's (erantom@gmail.com) Master thesis, which produces vocalized and morphologically tagged Hebrew verbs given a non-vocalized verb in base-form and an indication of which pattern the verb follows.
  • HebMorph [Lucene] {AGPL-3.0} - An open-source effort to make Hebrew properly searchable by various IR software libraries. Includes Hebrew Analyzer for Lucene.
  • Hebrew OCR with Nikud [Python] {?} - A program to convert Hebrew text files (without Nikud) to text files with the correct Nikud. Developed by Adi Oz and Vered Shani.
  • Text-Fabric [Python] {CC BY-NC 4.0} - A Python package for browsing and processing ancient corpora, focused on the Hebrew Bible Database.
  • Nakdan - Automatic Nikud for Hebrew texts.
  • The Automatic Hebrew Transriber - Automatically transcribes text from Hebrew audio and video files.
  • word2word {Apache License 2.0} - Easy-to-use word-to-word translations for 3,564 language pairs. Hebrew is one of the 62 supported language, and thus word-to-word translation to/from Hebrew is supported for 61 languages.

3.6   Commercial services

  • Eyfo - A commercial engine for search and entity tagging in Hebrew.
  • Melingo's ICA (Intelligent Content Analysis) - A text analysis and textual categorized entity extraction API for Hebrew, Arabic and Farsi texts.
  • Genius - Automatic analysis of free text in Hebrew.
  • AlmaReader - Online text-to-speech service for Hebrew.

4   Labs & Researchers

This list is meant to cover both researchers in the field of natural language processing, and in various related fields, including neurolinguistics and speech science. It also aims to cover researchers in both academia and industry.

4.1   Academia

4.2   Non-Profit

  • Allen Institute for AI - Israel
    • Prof. Yoav Goldberg
    • Dr. Jonathan Berant

4.3   Industry

Researching natural language processing in the industry? Open a pull request and add yourself here now!

5   Papers

5.1   Corpora, Lexicon and Dictionary Generation

5.2   Morphological Analysis & Disambiguation

5.3   Word Embeddings

5.4   Methodology

5.5   Other

6   Courses, presnetations and meetups

You can’t perform that action at this time.