The PluG (Pluperfect GRAC) corpus is a collection of Ukrainian texts from the General Regionally Annotated Corpus of Ukrainian (GRAC: uacorpus.org). It covers texts from 1816 to 1954, including various types such as fiction, news articles, and other writings. The corpus focuses on works from before the mid-20th century and contains texts by 7,590 unique authors and 44 unique translators.
The corpus features 42,000 files with 58,676,313 tokens (109M Gemma). It consists of copyright-free classic literature and other old texts suitable for LLM training, computational linguistics studies and education. The texts of the corpus were extracted from printed sources using OCR and corrected manually. It includes some texts written in old orthographical systems (Kulishivka, Zhelekhivka, Skrypnykivka). The texts come from various regions of Ukraine, with many from cities like Kyiv, Lviv, and Kharkiv. PluG includes both original Ukrainian works and translations from other languages.
PluG2 is an expanded version of the PluG corpus that contains a larger collection of Western Ukrainian texts from the 1880s to the 1920s written using the orthography system of the time (Zhelekhivka). PluG2 features 73,900,596 tokens. The added texts represent not only a distinctive orthographic system, but also a separate historical variant of literary Ukrainian, which has numerous peculiar grammatical and lexical features and can cause complications when training models oriented to the modern standard.
The corpus is available under CC-BY license. It is designed as a dataset for applied linguistic studies, providing a valuable resource for research on Ukrainian literature, language development, and cultural history of the 19th and early 20th centuries. The corpus provides a wide range of metadata for each text, including information about authors, translators, years of publication, genres, styles, and locations.
Full tagset used in the meta-annotation are available on the GRAC website: https://uacorpus.org/rozmitka-tekstiv/stili-tematika-i-zhanri
It's planned to be updated yearly to keep the resource up-to-date and valuable for researchers.
Please cite PluG:
Maria Shvedova, Arsenii Lukashevskyi (2024): PluG: Corpus of Old Ukrainian Texts. Electronic resource: Kharkiv, Jena. Available at https://github.com/Dandelliony/pluperfect_grac


