Suggested Citation for this release: Maxim Romanov, and Masoumeh Seydi. 2019. “OpenITI: A Machine-Readable Corpus of Islamicate Texts”. Zenodo. doi:10.5281/zenodo.3082464.
* Note on Release Numbering: Version 2019.1.1—where 2019 is the year of the release, the first dotted number—.1—is the ordinal release number in 2019, and the second dotted number—.1—is the overall release number; the first dotted number will reset every year, while the second one will continue on increasing.
Co-PIs: Matthew Thomas Miller (University of Maryland, College Park), Maxim G. Romanov (University of Vienna), Sarah Bowen Savant (Aga Khan University—ISMC, London).
Open Islamicate Texts Initiative (OpenITI, see https://openiti.github.io/) is a multi-institutional effort to construct the first machine-actionable scholarly corpus of premodern Islamicate texts. Led by researchers at the Aga Khan University (AKU), University of Vienna (UV), Leipzig University (LU), and the Roshan Institute for Persian Studies at the University of Maryland (College Park) and an interdisciplinary advisory board of leading digital humanists and Islamic, Persian, and Arabic studies scholars, OpenITI aims to provide the essential textual infrastructure in Arabic, Persian and other Islamicate languages for new forms of textual analysis and digital scholarship. In the process, OpenITI will enable new synergies between Digital Humanities and the inter-related Islamicate fields of Islamic, Persian, and Arabic Studies. In addition to support from the researchers’ home institutions, it is supported by funding from the European Research Council and the Qatar National Library.
Currently, OpenITI contains almost exclusively Arabic texts, which were first assembled into a corpus within the OpenArabic project, developed first at Tufts University (at The Perseus Project, 2013–2015) and then at Leipzig University (at the Alexander von Humboldt Chair for Digital Humanities, 2015–2017)—in both cases with the support and under the patronage of Prof. Gregory Crane. The much more limited number of Persian texts were compiled during 2015-2016 in the Persian Digital Library (PDL) pilot (see: https://persdigumd.github.io/PDL/) at Roshan Institute for Persian Studies at the University of Maryland. These texts have not been made fully compatible with OpenITI mARkdown yet and will be made fully available in next releases.
Machine-readable metadata on the corpus is available in the "OpenITI_metadata_2019_1_1" file in TSV format, including the following columns:
- versionUri : a human-readable URI of the current version of the book including the date of the death of the author, author’s name, book name, ID of the original source (online library) where the source comes from, and the language domain (e.g., -ara1 is for Arabic, -per1—for Persian). URIs were deliberately made human-readable to make it easier to work with the corpus. Examples of URIs and the description of URI formation principles can be found at https://maximromanov.github.io/OpenITI/#cts-compliant-naming-pattern.
- date : Date of death of the author (hijrī, from versionURI)
- author : Name of the author (from versionURI)
- book : Book title (from versionURI)
- id : Book id, which originally comes from the collection the source is taken from (from versionURI)
- status : pri/sec — “pri” value can be used to select only unique titles from the corpus (i.e., excluding versions of the same title)
- length: length in words
- url: link to the text on github
- tags: genre/subject tags aggregated from the metadata of original collections (partially unified)
- localPath: local path to the text in the OpenITI folder structure
The goal of the OpenITI is to build a machine-actionable corpus of premodern texts in Islamicate languages to encourage computational analysis of the Islamicate written tradition. Most of the Arabic texts have been collected from open-access online collections of premodern and modern Arabic texts such as http://shamela.ws/ and http://shiaonlinelibrary.com/ (These texts have
Shia+NUMBER; some texts come from al-Jāmiʿ al-kabīr, which has been published on an external HDD and is not available online (
JK+NUMBER). Initial metadata from these collections is preserved at the beginning of each file. (The next release will include a number of Persian texts, which are coming primarily from the Ganjoor digital library, https://ganjoor.net/).
Currently uploaded texts have been automatically converted into the OpenITI mARkdown format—a flavor of markdown that was developed for tagging premodern Islamicate texts. All of our texts require further editing to properly tag their structure. A detailed description of the mARkdown scheme and the tagging workflow can be found in the OpenITI mARkdown section (https://maximromanov.github.io/mARkdown/). When manual tagging is complete the texts will be converted into a CTS-compliant XML format.
The current version includes the following file extensions:
- [no extension]: This is a RAW file, automatically converted from its initial format to be as close to the OpenITI mARkdown format as possible. NB: Since the corpus is a work in progress and many texts have not yet been manually edited, tags that may appear in texts do not necessarily correspond to the proper OpenITI mARkdown scheme!
- *.completed : The conversion of the file is completed, but the file still requires final verification and vetting.
- *.mARkdown : The file has been verified and vetted.
In the long run we envision that the entire corpus will be converted into TEI XML and made available to a wider public as a digital library.
Statistics on the corpus
|Number of titles (with all versions/editions)||7,144|
|Number of unique titles||4,288|
|Number of authors||1,859|
|Length in words (all)||1,520,667,360|
|Length in pages (300 w/p)||5,068,891|
|Length in words (unique)||755,689,541|
|Length in pages (unique; 300 w/p)||2,518,965|
Lengths of texts
|Words||Pages (300 w/p)|
Chronological Distribution of Texts
For more information on OpenITI, see https://maximromanov.github.io/OpenITI/.
Link to Zenodo: https://zenodo.org/record/3082464
OpenITI (main folder)
- main data folder containing subfolders (Author > Book > Versions)
README.md: release notes
OpenITI_metadata_2019_1_1: metadata file