# Exploring Alignment Data

This notebook shows how to load and explore the aligned Bible data available from this repository.

Aligned words are identified using a BBCCCVVVWWWP format: for example, "400010020031" refers to

* Matthew (book 40)
* Chapter 1 (001)
* Verse 2 (002)
* Word 3 in this text's sequence (003)
* Part 1 (For Greek, this is always 1: for Hebrew, words may be segmented into multiple parts)


In [1]:
import config

## Available Alignments

Alignments are grouped by language (using ISO-639-3 codes) and then by version abbreviation. Typically the OT alignment is separate from the NT alignment.

The alignment file format is described in docs/format.md .

In [2]:
for al in sorted([f"data/{lang.name}/{version.name}" 
                  for lang in config.ALIGNMENTS.glob("*")
                  for version in lang.glob("*")]):
    print(al)

data/eng/ESV
data/eng/LEB
data/eng/NET
data/eng/YLT
data/hin/HSB
data/man/CUVMP


In [3]:
# An example of the alignment file for the NA27 Greek New Testament aligned with the English Young's Literal Text (YLT).
# 
!head ../data/alignments/eng/YLT/NA27-YLT-manual.json

[
{"40001001.1": {"NA27": ["400010010011"], "YLT": ["40001001002", "40001001001"], "meta": {"process": "manual"}}},
{"40001001.2": {"NA27": ["400010010021"], "YLT": ["40001001005", "40001001003"], "meta": {"process": "manual"}}},
{"40001001.3": {"NA27": ["400010010031"], "YLT": ["40001001007", "40001001006"], "meta": {"process": "manual"}}},
{"40001001.4": {"NA27": ["400010010041"], "YLT": ["40001001008"], "meta": {"process": "manual"}}},
{"40001001.5": {"NA27": ["400010010051"], "YLT": ["40001001010"], "meta": {"process": "manual"}}},
{"40001001.6": {"NA27": ["400010010061"], "YLT": ["40001001012", "40001001011"], "meta": {"process": "manual"}}},
{"40001001.7": {"NA27": ["400010010071"], "YLT": ["40001001014"], "meta": {"process": "manual"}}},
{"40001001.8": {"NA27": ["400010010081"], "YLT": ["40001001016", "40001001015"], "meta": {"process": "manual"}}},
{"40001002.1": {"NA27": ["400010020011"], "YLT": ["40001002001"], "meta": {"process": "manual"}}},


## Source and Target Files

For each alignment file, there is a corresponding source and target file in TSV format that identifies each word by a unique identifier. Some source or target texts are copyrighted: in such cases, the surface text is omitted, but other metadata is still available.

Here's an example for the English Young's Literal Text (YLT), aligned with the NA27 Greek New Testament. This format comes from manual alignments done by Grape City (GC).

In [4]:
# An example of the source file for the NA27 Greek New Testament.
# 
!head ../data/sources/NA27-YLT.tsv

identifier	altId	text	strongs	gloss	gloss2	lemma	pos	morph
400010010011	--	--	G0976	a record	A record	βίβλος	noun	n- -nsf-
400010010021	--	--	G1078	of [the] genealogy	of genealogy	γένεσις	noun	n- -gsf-
400010010031	--	--	G2424	of Jesus	of Jesus	Ἰησοῦς	Name	nr -gsm-
400010010041	--	--	G5547	Christ	Christ	Χριστός	Name	nr -gsm-
400010010051	--	--	G5207	son	son	υἱός	noun	n- -gsm-
400010010061	--	--	G1138	of David	of David	Δαυίδ	Name	nr -gsm-
400010010071	--	--	G5207	son	son	υἱός	noun	n- -gsm-
400010010081	--	--	G0011	of Abraham.	of Abraham	Ἀβραάμ	Name	nr -gsm-
400010020011	--	--	G0011	Abraham	Abraham	Ἀβραάμ	Name	nr -nsm-


In [5]:
# An example of the target file for the NA27 Greek New Testament.
# 
!head ../data/targets/NA27-YLT.tsv

identifier	altId	text	transType	isPunc	isPrimary
40001001001	A-1	A	m	False	False
40001001002	roll-1	roll	k	False	True
40001001003	of-1	of	m	False	False
40001001004	the-1	the		False	False
40001001005	birth-1	birth	k	False	True
40001001006	of-2	of	m	False	False
40001001007	Jesus-1	Jesus	k	False	True
40001001008	Christ-1	Christ	k	False	True
40001001009	,-1	,		False	False


## Loading Source Data into Python

This example loads the Grape City source data for NA27. The text values are replaced with "--" because of copyright.

In [6]:
import gcsource

rd = gcsource.Reader(sourceid="NA27", targetid="YLT")
print(f"{len(rd)} words for {rd.sourceid}.")

138014 words for NA27.


In [7]:
# display the word data for Mark 4:3 (41004003*)
for identifier, word in rd.items():
    if identifier.startswith("41004003"):
        word.display()

410040030011: --                   ('listen,', 'ἀκούω', verb)
410040030021: --                   ('behold', 'ὁράω', verb)
410040030031: --                   ('went out', 'ἐξέρχομαι', verb)
410040030041: --                   ('the one', 'ὁ', det)
410040030051: --                   ('sowing', 'σπείρω', verb)
410040030061: --                   ('to sow [seed].', 'σπείρω', verb)


In [8]:
import gctarget

rd = gctarget.Reader(sourceid="NA27", targetid="YLT")
print(f"{len(rd)} words for {rd.targetid}.")

223880 words for YLT.


In [9]:
# display the word data for Mark 4:3 (41004003*)
for identifier, word in rd.items():
    if identifier.startswith("41004003"):
        word.display()

41004003001: ‘                    ('', False, False)
41004003002: Hearken              ('k', False, True)
41004003003: ,                    ('', False, False)
41004003004: lo                   ('k', False, True)
41004003005: ,                    ('', False, False)
41004003006: the                  ('k', False, True)
41004003007: sower                ('k', False, True)
41004003008: went                 ('k', False, True)
41004003009: forth                ('k', False, False)
41004003010: to                   ('m', False, False)
41004003011: sow                  ('k', False, True)
41004003012: ;                    ('', False, False)


More to come ...