Skip to content

Latest commit

 

History

History
80 lines (59 loc) · 3.42 KB

index.md

File metadata and controls

80 lines (59 loc) · 3.42 KB

Israeli Association of Human Language Technologies

IAHLT is a nonprofit organization whose goal is to create annotated datasets for Hebrew and Arabic. We open-source several of our resources to the public, and the rest is accessible to fee-paying members only.

Open resources

Each release comes with scripts and pretrained models. For details, see the README in each repository.

Resource Type Size
UD_Hebrew-IAHLTwiki treebank + named entities, conllu + biose 5k sentences
UD_Hebrew-IAHLTknesset treebank, conllu 2.5k sentences
Arabic-Morphology-IAHLT morphology, conllu 2k sentences

Member resources

Please contact info@iahlt.org to inquire about access.

Resource Text Type Quantities
Hebrew Universal Dependencies diverse 48,000 sentences
Hebrew Named Entities diverse 47,000 paragraphs
Arabic Named Entities newspaper 75,000 paragraphs
Arabic Morphology newspaper 5,973 sentences
Arabic transcriptions - Palestinian dialects diverse 93,000 tokens

We also have UD and NER parsers trained over these resources.

Hebrew Universal Dependencies

Our Hebrew treebank consists of over 45k dependency trees, including hundreds of complete documents, from Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, the technology magazine Geektime, and the news sites Davar and Israel Hayom.

Source Genres no. Sentences no. Documents
Davar Hebrew news 11224 369
GeekTime News, writing for the web, sometimes nonstandard Hebrew 4213 0
Israel Hayom Hebrew news 386 7
Knesset Knesset protocols 2883 100
Kol Zchut Information about entitlements rights of Israelis 11214 313
Wikipedia Biographies, events, locations, legal, medical 11309 78

Two independent annotations for over 6k of the 39k total sentences. Kappa inter-annotator agreement score per-feature.

Arabic Morphology

Our Arabic lemmatisation and part-of-speech tagging corpus consists of 5973 annotations of 4869 sentences (with a total of 171722 tokens annotated, 12190 unique lemmas) for lemma and part-of-speech. The texts were sampled from the news articles. For the multiply-annotated sentences, inter-annotator agreement statistics are included.

Arabic and Hebrew Named Entities

Our named entities corpus includes over 123k paragraphs, over 11k documents in Arabic and Hebrew, annotated for 13 entity types.

Language Source no. Paragraphs no. Articles
Hebrew Davar, Israel Hayom, Knesset protocols, and Wikipedia 47k 2923
Arabic News 75k 8342

Palestinian Arabic Transcriptions

Our Palestinian Arabic corpus consists of 90k tokens of transcribed text from 32 YouTube videos. Ten of these videos were independently annotated twice to study inter-annotator agreement; the annotations and our calculated agreements are included.

Year Videos Duration (min) no. Tokens
2010-2014 4 18 2131
2015-2019 19 484 37026
2020- 11 686 47129