Full Stack of Latvian Language Resources for NLU and NLG
This repository contains a multilayer text corpus of Latvian. The broad application area we address is natural language understanding (NLU) and generation (NLG), and the aim of the corpus is to develop a data-driven NLU and NLG toolchain for Latvian, as well as to use it in linguistic studies. Both the multilayer corpus and the downstream applications are anchored in cross-lingual state-of-the-art representations: Universal Dependencies (UD), FrameNet, PropBank and Abstract Meaning Representation (AMR). A complementary representation, language resource and technology for NLG, which is being developed separately (incl. the Latvian resource grammar), is Grammatical Framework (GF).
The UD representation is automatically derived from a more elaborated and manually annotated hybrid dependency-constituency representation. The FrameNet annotations are manually added, guided by the underlying UD annotations. Consequently, frame elements are represented by the root nodes of the respective subtrees instead of text spans; the spans can be easily calculated from the subtrees. The PropBank layer is automatically derived from the FrameNet and UD annotations, provided a manual mapping from FrameNet lexical units to PropBank frames, and a mapping from FrameNet frame elements to PropBank semantic roles for the given pair of FrameNet and PropBank frames. Draft AMR graphs are derived from the PropBank and UD layers, as well as auxiliary layers containing named entity and coreference annotations. The semantically rich FrameNet annotations are also helpful in acquiring more accurate AMR graphs.
We aim at a medium-sized corpus: 10-15 thousand sentences annotated at all layers. Therefore it is important to ensure that the multilayer corpus is balanced not only in terms of text genres and writing styles but also in terms of lexical units. A fundamental design decision is that the text unit is an isolated paragraph. The multilayer corpus therefore consists of manually selected paragraphs from many different texts of various types. Representative paragraphs are selected in different proportions from a balanced 10-million-word text corpus: 60% news sources, 20% fiction, 10% legal texts, 5% spoken language, 5% miscellaneous.
As for lexical units, the goal is to cover 1,000-2,000 most frequently occurring verbs, calculated from the 10-million-word corpus. Since the most frequent verbs tend to be the most polysemous, we expect that the number of lexical units (verb senses w.r.t. semantic frames) will be considerably larger (2,000-4,000). We expect that the corpus is rather balanced also w.r.t. nominal lexical units.
N. Gruzitis, L. Pretkalnina, B. Saulite, L. Rituma, G. Nespore-Berzkalne, A. Znotins, P. Paikens. Creation of a balanced state-of-the-art multilayer corpus for NLU. Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC), 2018
A. Znotins, E. Cirule. NLP-PIPE: Latvian NLP Tool Pipeline. Human Language Technologies - The Baltic Perspective. Frontiers in Artificial Intelligence and Applications, vol. 307, IOS Press, 2018
N. Gruzitis, G. Nespore-Berzkalne, B. Saulite. Creation of Latvian FrameNet based on Universal Dependencies. Proceedings of the International FrameNet Workshop 2018: Multilingual FrameNets and Constructicons (IFNW), 2018
G. Nespore-Berzkalne, B. Saulite, N. Gruzitis. Latvian FrameNet: Cross-Lingual Issues. Human Language Technologies - The Baltic Perspective. Frontiers in Artificial Intelligence and Applications, vol. 307, IOS Press, 2018
Paikens, P., Grūzītis, N., Rituma, L., Nešpore, G., Lipskis, V., Pretkalniņa, L., Spektors, A. Enriching an Explanatory Dictionary with FrameNet and PropBank Corpus Examples. Proceedings of the 6th Biennial Conference on Electronic Lexicography (eLex), 2019
L. Pretkalnina, L. Rituma, B. Saulite. Deriving Enhanced Universal Dependencies from a Hybrid Dependency-Constituency Treebank. Text, Speech, and Dialogue. Lecture Notes in Computer Science, vol. 11107, Springer, 2018
N. Gruzitis, D. Gosko, G. Barzdins. RIGOTRIO at SemEval-2017 Task 9: Combining machine learning and grammar engineering for AMR parsing and generation. Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval), 2017
This work is supported by the European Regional Development Fund under the grant agreements No. 220.127.116.11/16/A/219 (Full Stack of Language Resources for Natural Language Understanding and Generation in Latvian) and No. 18.104.22.168/VIAA/1/16/188 (From Abstract Meaning Representation to Natural Language Sentence and Coherent Text Generation).
The treebank layer from which the UD reresentation is derived has been annotated using an extended version of TrEd. The FrameNet layer as well as the named entity and coreference layers have been annotated using a customised instance of WebAnno. Draft AMR graphs have been acquired using the Hugo.lv LV-EN neural MT system and the AMREager parser.
These data sets by AiLab are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
By using these data sets, you agree to comply with the European Intellectual Property Rights and the European General Data Protection Regulation.
Please, cite the relevant publications if you use this data in your research. Please, let us know if you use this data in the development of products or services. Your citations and feedback are important to secure funding for the further development of these data sets.
Project coordinator: Normunds Grūzītis,
Team members: Ilze Auziņa, Guntis Bārzdiņš, Roberts Darģis, Mikus Grasmanis, Kristīne Levāne-Petrova, Gunta Nešpore-Bērzkalne, Pēteris Paikens, Lauma Pretkalniņa, Laura Rituma, Inguna Skadiņa, Baiba Valkovska (Saulīte), Artūrs Znotiņš