This repository contains the documentation for Opera Graeca Adnotata (OGA), a multilayer corpus of Ancient Greek texts with automatically generated annotations (the largest one of its kind) 🏋️❤️😃:
- number of base texts: 1,999, each corresponding to an ancient work
- number of tokens: 40,105,221
The Ancient Greek works included in OGA come from the GitHub repositories:
- canonical-greekLit (release 0.0.11376425141)
- First1KGreek (release 1.1.11352615003)
- PatristicTextArchive (release 1.1.11363682704)
In general, OGA contains one edition for each work (the duplicates in canonical-greekLit and First1KGreek were discarded), with 90 exceptions coming from PatristicTextArchive and documented in urn_cts/texts/duplicates_tlg_pta.xml. The list of all texts included in OGA is in urn_cts/texts/urn_cts_plus_date_label.xml.
Because of the large corpus size, the actual corpus data is made available on Zenodo at
The data can be queried (also) through the ANNIS 4 search tool at:
To know more about how to query the corpus with ANNIS 4, see the documentation in the folder query, where examples can be found. ANNIS 4 is also easily executable on a desktop computer: consider using the desktop version with the data (~21G) made available on Zenodo, if you have enough space on your computer.
An overview of the corpus can be read on the accompanying website at
The present repository contains a few useful manually annotated files used to create annotations in OGA. The present repository is meant to be the place to expand and discuss them in a collaborative way (through a GitHub Issue or Discussion).
The corpus can be queried by:
- word form (i.e., token)
- lemma
- morphology (POS and morphological features)
- syntax (dependency syntax following the AGDT annotation scheme)
- CTS URN for work, author, and edition
- CTS passage for each work (e.g., "book", "section", etc.)
- author name
- work title
- alleged composition date for each work
- (experimental) IPA transcription of word forms (the 5th-century BCE Attic ones)
The morphosyntactic annotation and lemmatization are the outputs of (i) a parser and (ii) lemmatizer trained on 1,2M+ tokens of AGDT data (a much larger dataset than all UD ones combined): the parser and lemmatizer are documented in the article A State-of-the-Art Morphosyntactic Parser and Lemmatizer for Ancient Greek. Since the training dataset contained a large selection of texts of different centuries and genres, the models used for OGA are expected to generalize much better than the existing ones trained on much smaller (UD) datasets. The final scores for the models are:
POS | XPOS | Feats | ALLTags | UAS | LAS | Lemmas |
---|---|---|---|---|---|---|
96.41 | 91.90 | 94.77 | 91.56 | 82.60 | 77.10 | 91.41 |
The annotations for CTS URN, CTS passage, author name, and work title were retrieved automatically from the original texts (and therefore they may contain errors and inconsistencies).
The annotation for work composition dates was done manually and is a work in progress (see chronology_greek_works.xml).
The IPA transcription is based on a ByT5 model that achieved an accuracy of 0.83 (correct IPA transcriptions) on Greek and Latin data from Wiktionary. The IPA transcription is the 5th-century BCE Attic one (see, for example, ἄξιοι).
The present repository is organized as follows (further details within each folder):
abbreviations
contains a file useful for tokenization.annotation_example
contains an unzipped example of the annotation layers, which is useful for inspection.elision
contains a file used for tokenization.query
contains information to query the corpus.tokenize
contains files used for tokenization.urn_cts
contains files with bibliographic information about the texts in OGA.work_chronology
contains a manually annotated file with the alleged composition dates of Greek works, which continues to be updated.
If you use OGA v0.2.0 or material within this repository, please cite it thus (the following article describes an earlier version of OGA, i.e., v0.1.0, but is still relevant):
Giuseppe G. A. Celano. Opera Graeca Adnotata: Building a 34M+ Token Multilayer Corpus for Ancient Greek. arXiv https://arxiv.org/abs/2404.00739.
@misc{celano2024operagraecaadnotatabuilding,
title= {Opera Graeca Adnotata: Building a 34M+ Token Multilayer Corpus for Ancient Greek},
author= {Giuseppe G. A. Celano},
year= {2024},
eprint= {2404.00739},
archivePrefix= {arXiv},
primaryClass= {cs.CL},
url= {https://arxiv.org/abs/2404.00739},
}
Direct citation of the corpus is:
Giuseppe G. A. Celano. 2024. Opera Graeca Adnotata (v0.2.0). Zenodo.
https://doi.org/10.5281/zenodo.14206061
@misc{celanoOGA020,
author = {Giuseppe G. A. Celano},
title = {Opera Graeca Adnotata},
year = {2024},
publisher = {Zenodo},
version = {v0.2.0},
doi = {10.5281/zenodo.14206061},
url = {https://doi.org/10.5281/zenodo.14206061}
}
Dr. Giuseppe G. A. Celano
Universität Leipzig
Institute of Computer Science, NLP
Augustusplatz 10
04109 Leipzig
Deutschland
mysurname at informatik.uni-leipzig.de
(Project number: 408121292)
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (for more details, look also at the repositories of the original texts mentioned above).