A multilingual corpus of aligned Biblical and Qur’anic texts — spanning medieval and modern languages — designed for computational and philological alignment tasks.
It serves as a benchmark dataset for multilingual alignment of historical religious texts.
📢 Important Notice on Licensing
This repository includes only alignment metadata. Some source texts (especially medieval Bible translations and parts of the Qur’anic corpus) are not redistributed due to licensing restrictions.
Please consult the documentation for access or citation of original sources.
A multilingual dataset of aligned Biblical and Qur’anic texts, primarily in medieval languages, gathered from diverse external sources (see the 📂 Data Sources section). Selected modern editions are also included to enhance linguistic diversity and improve the robustness and generalizability of sentence alignment models.
The dataset is designed to support training and evaluation for historical, philological, and comparative linguistic applications.
This dataset provides training data for multilingual alignment models. It includes over 48,000 aligned verses and more than 4 million verse-level pairs, covering 29 versions in 9 languages. The corpus spans both medieval and modern textual traditions.
It is intended as an open and extensible training resource for multilingual NLP — not a fixed benchmark. Future releases may include additional sources or metadata.
📌 Each aligned verse includes two or more versions. Pair counts reflect all n choose 2 language pairs per verse.
This dataset is designed to support the development and evaluation of multilingual alignment models tailored to historical texts — a domain often underserved by modern NLP resources.
Unlike standard parallel corpora, this dataset addresses challenges specific to historical-language contexts, including:
- Structural divergence across religious and textual traditions
- Flexible or free word order in premodern languages
- Non-standardized orthographies and inconsistent editorial practices
- Gaps, mismatches, and overlaps in verse segmentation across versions
By providing aligned data across a wide range of languages and periods, the corpus aims to:
- 
Enable robust training of alignment systems for historical and philological contexts 
- 
Support research on translation shifts and textual transmission across traditions 
- 
Serve as a flexible foundation for extended corpus-building, enrichment, or annotation ➡️ See docs/verse_alignment_guidelines.md for detailed alignment criteria and pairing logic. 
- NLP researchers tackling low-resource or historical alignment
- Digital humanists studying multilingual translation or textual variants
- Scholars exploring transmission across religious and linguistic boundaries
| Feature | 📖 Biblia Corpus | 🕋 Qur’anic Corpus | 
|---|---|---|
| Text Types | Biblical texts (medieval and selected modern editions) | Qur’anic translations in historical European languages and Arabic | 
| Languages | Latin, French, English, Castilian, Catalan, Italian, Portuguese, Greek | Arabic, Latin, English, French, Italian | 
| Alignment Unit | Verse-level (approximating sentence or clause) | Verse-level (based on surah:ayah structure) | 
| Format | JSON | JSON | 
| Use Case | Training multilingual alignment models (not for textual criticism) | Training multilingual alignment models (not for religious or exegetical use) | 
| Aligned Verses | 42,562 | 6,236 | 
| Aligned Pairs | 3,927,811 | 114,226 | 
The Biblical and Qur’anic texts were selected for their structural compatibility — namely, their verse-based (or surah:ayah in the case of the Qur’an) organization — and their widespread cross-linguistic transmission, which enables meaningful alignment across centuries and traditions.
| Language | Text | Source | Format | 
|---|---|---|---|
| en | John Wycliffe Bible | GitHub | .txt | 
| en | Coverdale Bible | GitHub | .xml | 
| en | Great Bible | EDGeS Corpus | .tsv | 
| it | Gospel of St. Matthew | Caterina Menichetti Edition | .pdf | 
| fr | La Bible historiale | Project site | .xml | 
| fr | Esther, Judith, Ruth | Texts kindly provided by Claudio Lagomarsini | Word* | 
| fr | Gospel of Matthew | Transcription kindly provided by Seth Middleton | .txt* | 
| gr | Septuagint (LXX) | Corpus Corporum | .xml | 
| es | Three Medieval Bibles | Proyecto Biblia Medieval | .txt | 
| ca | Three Medieval Bibles | Texts kindly provided by Pere Casanellas (Corpus Biblicum Catalanicum) | .xml, Word* | 
| la | Vulgata Sixto-Clementina | GitLab | .xml | 
* These texts are not publicly shareable due to copyright restrictions.
We gratefully acknowledge the following scholars and institutions for their contributions of source material or expertise:
- Pere Casanellas (Corpus Biblicum Catalanicum) – Catalan biblical texts based on the Egerton, Peiresc, and Colbert manuscripts
- Claudio Lagomarsini – Provided French texts of Esther, Judith, and Ruth (Bible du XIIIe siècle)
- Mouhamadoul-Khaly Wélé – Multilingual aligned dataset based on the Quran
- Seth Middelton – French transcription of the Gospel of Matthew (Bible du XIIIe siècle)
➡️ See docs/biblical_alignment_challenges.md for notes on structural complexity, exclusions, and philological variation.
Nine Bibles in French, English, Portuguese, Greek, and Spanish from this repository, used to augment language diversity.
➡️ For preprocessing and integration steps, see docs/alignment_workflow.md.
Multilingual alignment produced by the Coran 12-21 project — co-directed by Mouhamadoul-Khaly Wélé and Tristan Vigliano — covering 7 languages (Arabic, Latin, English, French, Italian, etc.), with texts kindly provided by Mouhamadoul-Khaly Wélé. Note: This resource is not publicly redistributable.
The dataset is stored in structured JSON files:
- Monolingual format: dictionary of {book → list of {ref, text}}
- Multilingual format: list of aligned verses, each with book,ref, and adatamap of translations
➡️ See docs/data_structure.md for full examples and schema.
The snippet below illustrates how to explore aligned verse pairs in the JSON file.
Each verse contains a book, ref, and a data dictionary mapping version IDs to verse translations.
import json
with open("aligned_data.json") as f:
    data = json.load(f)
# Display all aligned French–Portuguese verse pairs from Genesis
# 💡 Change language IDs below based on your alignment interest
for verse in data:
    if verse["book"] == "genesis":
        fr = verse["data"].get("fr_lsegond")
        pt = verse["data"].get("pt_almeida")
        if fr and pt:
            print(f'{verse["ref"]}:\nFR: {fr}\nPT: {pt}\n')Note: aligned_data.json is not distributed due to licensing. Use this example to preview the data format and structure.
This corpus is an initial foundation intended to grow. Several improvements are planned to enhance its usability, accuracy, and scholarly value:
- 
Prioritize the collection and structuring of additional medieval texts, especially from Romance-language traditions, to rebalance the dataset—currently skewed toward modern sources—and progressively specialize it for historical modeling. This will improve alignment robustness for premodern domains and enable the training of models tailored to medieval textual data. 
- 
Incorporate OCR/HTR of manuscript texts 
 Leveraging Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR) will allow the inclusion of otherwise inaccessible sources, especially for underrepresented medieval texts not available in digital editions.
- 
Annotate editorial provenance and textual lineage 
 Metadata will be enriched to reflect the textual origin (e.g., manuscript family, editor, edition), enabling philological and stemmatic analysis across traditions.
- 
Develop a queryable interface or API 
 To support broader reuse and exploration, develop a queryable interface or lightweight CLI tool is under consideration, allowing users to search and extract aligned verses by book, chapter, language pair, or version — without requiring users to load the full dataset into memory
- Current version: v0.1
- Next planned update: Continued cleaning and integration of additional medieval texts already sourced — targeted for Q4 2025
This repository is part of a broader ecosystem of tools and corpora developed for the study of medieval multilingual textual traditions:
- 
Aquilign 
 A clause-level multilingual alignment engine based on contextual embeddings (LaBSE), designed specifically for premodern texts.
- 
Multilingual Segmentation Data 
 Source texts and segmented versions in multiple medieval Romance languages, as well as Latin and English, used for training and evaluating clause segmentation models.
- 
Lancelot par maints langages 
 A parallel corpus of translations of the Lancelot en prose in medieval French, Castilian, and Italian, segmented and aligned using the Aquilign pipeline.
- 
Multilingual Aegidius 
 A parallel corpus of translations of Aegidius Romanus’ De regimine principum in Latin, medieval Romance languages, and English, processed using the same segmentation and alignment workflow.
- 
🧱 Data Structure Schema 
 ➡️ docs/data_structure.md
- 
⚙️ Alignment Workflow 
 ➡️ docs/alignment_workflow.md
- 
📐 Alignment Guidelines 
 ➡️ docs/verse_alignment_guidelines.md
- 
🧩 Structural Variation and Exclusions 
 ➡️ docs/biblical_alignment_challenges.md
- 
📈 Dataset Statistics Summary 
 ➡️ docs/dataset_statistics.md
This work benefited from national funding managed by the Agence Nationale de la Recherche under the Investissements d'avenir programme with the reference ANR-21-ESRE-0005 (Biblissima+).
Ce travail a bénéficié d'une aide de l’État gérée par l’Agence Nationale de la Recherche au titre du programme d’Investissements d’avenir portant la référence ANR-21-ESRE-0005 (Biblissima+).
The alignment metadata produced for this dataset are distributed under the CC BY-NC-SA 4.0 license, unless otherwise noted.
⚠️ Some source texts—particularly certain medieval Bible translations and portions of the Qur’anic corpus—are not included in this repository due to third-party copyright restrictions.
Please refer to the documentation or the original editions for licensing information on specific versions.

