# Exploring Alignment Data

This notebook shows how to load and explore the aligned Bible data available from this repository.

Aligned words are identified using a BBCCCVVVWWWP format: for example, "400010020031" refers to

* Matthew (book 40)
* Chapter 1 (001)
* Verse 2 (002)
* Word 3 in this text's sequence (003)
* Part 1 (For Greek, this is always 1: for Hebrew, words may be segmented into multiple parts)


In [1]:
from bible_alignments import ROOT, DATAPATH, ALIGNMENTS, SOURCES, TARGETS


## Available Alignments

Alignments are grouped by language (using ISO-639-3 codes) and then by version abbreviation. Typically the OT alignment is separate from the NT alignment.

The alignment file format is described in docs/format.md .

Here's a list of the currently available alignments.

In [2]:
alignments = sorted([f"data/{lang.name}/{version.name}" 
                     for lang in ALIGNMENTS.glob("*") 
                     for version in lang.glob("[A-Z]*")])
for al in alignments:
    print(al)

data/arb/AVD
data/eng/BSB
data/eng/YLT
data/hin/HSB


Here's an example of the raw alignment file for the SBLGNT aligned with the Berean Standard Bible, in Scripture Burrito format.

In [3]:
!head -20 ../../data/alignments/eng/BSB/SBLGNT-BSB-manual.json

{
 "documents": [
    {"docid": "SBLGNT", "scheme": "BCVWP"},
    {"docid": "BSB", "scheme": "BCVWP"}
 ],
 "meta": {"conformsTo": "0.3", "creator": "Berean Bible"},
 "roles": ["source", "target"],
 "type": "translation",
 "records": [
 {"source": ["n40001001001"], "target": ["400010010011", "400010010021", "400010010031", "400010010041"], "meta": {"id": "40001001.001", "process": "manual"}},
 {"source": ["n40001001002"], "target": ["400010010051", "400010010061", "400010010071"], "meta": {"id": "40001001.002", "process": "manual"}},
 {"source": ["n40001001003"], "target": ["400010010081", "400010010091"], "meta": {"id": "40001001.003", "process": "manual"}},
 {"source": ["n40001001004"], "target": ["400010010101"], "meta": {"id": "40001001.004", "process": "manual"}},
 {"source": ["n40001001005"], "target": ["400010010121", "400010010131"], "meta": {"id": "40001001.005", "process": "manual"}},
 {"source": ["n40001001006"], "target": ["400010010141", "400010010151"], "meta": {"id": "400

## Source and Target Files

For each alignment file, there is a corresponding source and target file in TSV format that identifies each word by a unique identifier. Some source or target texts are copyrighted: in such cases, the surface text is omitted, but other metadata is still available.

Here's an example of the raw data for the SBLGNT, followed by the Berean Standard Bible. 

In [4]:
# The source file for the SBL Greek New Testament (SBLGNT).
# 
!head ../../data/sources/SBLGNT.tsv

id	altId	text	strongs	gloss	gloss2	lemma	pos	morph
n40001001001	Βίβλος-1	Βίβλος	G0976	[The] book	book	βίβλος	noun	N-NSF
n40001001002	γενέσεως-1	γενέσεως	G1078	of [the] genealogy	genealogy	γένεσις	noun	N-GSF
n40001001003	Ἰησοῦ-1	Ἰησοῦ	G2424	of Jesus	Jesus	Ἰησοῦς	noun	N-GSM
n40001001004	χριστοῦ-1	χριστοῦ	G5547	Christ	Christ	Χριστός	noun	N-GSM
n40001001005	υἱοῦ-1	υἱοῦ	G5207	son	son	υἱός	noun	N-GSM
n40001001006	Δαυὶδ-1	Δαυὶδ	G1138	of David	David	Δαυίδ	noun	N-PRI
n40001001007	υἱοῦ-2	υἱοῦ	G5207	son	son	υἱός	noun	N-GSM
n40001001008	Ἀβραάμ-1	Ἀβραάμ	G0011	of Abraham	Abraham	Ἀβραάμ	noun	N-PRI
n40001002001	Ἀβραὰμ-1	Ἀβραὰμ	G0011	Abraham	Abraham	Ἀβραάμ	noun	N-PRI


In [5]:
# The target file for the Berean Standard Bible New Testament
# 
!head ../../data/targets/eng/nt_BSB.tsv

id	altId	text	transType	isPunc	isPrimary
400010010011	This-1	This		False	True
400010010021	is-1	is		False	True
400010010031	the-1	the		False	True
400010010041	record-1	record		False	True
400010010051	of-1	of		False	True
400010010061	the-1	the		False	True
400010010071	genealogy-1	genealogy		False	True
400010010081	of-2	of		False	True
400010010091	Jesus-1	Jesus		False	True


## Loading Source Data

This example loads the source data for SBLGNT and displays some attributes of that data. 

In [6]:
from pprint import pprint
from bible_alignments.burrito import SourceReader

src = SourceReader(DATAPATH / "sources/SBLGNT.tsv")

print(f"Number of tokens: {len(src)}")
print(f"Number of distinct tokens: {len(src.vocabulary())}")
print(f"Number of distinct lemmas: {len(src.vocabulary(tokenattr='lemma'))}")

Number of tokens: 137741
Number of distinct tokens: 19355
Number of distinct lemmas: 5468


In [7]:
# dict: token ID -> Source() instance
# Show data for the first word from MRK 4:3. Note the leading 'n' indicating New Testament, a Macula convention.
pprint(src["n41004003001"].asdict())


{'altId': 'Ἀκούετε-1',
 'gloss': 'Listen',
 'gloss2': 'listen',
 'id': 'n41004003001',
 'lemma': 'ἀκούω',
 'morph': 'V-PAM-2P',
 'pos': 'verb',
 'strongs': 'G0191',
 'text': 'Ἀκούετε'}


In [8]:
# display the word data for Mark 4:3 (41004003*): gloss, lemma, part of speech
for identifier, word in src.items():
    if identifier.startswith("n41004003"):
        word.display()

n41004003001: Ἀκούετε		 (Listen, ἀκούω, verb)
n41004003002: ἰδοὺ		 (Behold, ἰδού, ptcl)
n41004003003: ἐξῆλθεν		 (went out, ἐξέρχομαι, verb)
n41004003004: ὁ		 (the [one], ὁ, det)
n41004003005: σπείρων		 (sowing, σπείρω, verb)
n41004003006: σπεῖραι		 (to sow, σπείρω, verb)


## Loading Target Data

In [9]:
from bible_alignments.burrito import TargetReader

trg = TargetReader(DATAPATH / "targets/eng/nt_BSB.tsv", idheader="identifier")

print(f"{len(trg)} words for {trg.identifier}.")

201087 words for nt_BSB.


In [10]:
# display the text for Mark 4:3
" ".join([term.text for termid, term in trg.items() if termid.startswith("41004003")])

'“ Listen ! A farmer went out to sow his seed .'

In [11]:
# display the word data for Mark 4:3 (41004003*)
# the additional parenthesized data indicates:
# - the translation type (empty for this data)pytho
# - whether the token is punctuation: this data looks unreliable
# - whether the token is the primary item for the alignment
for identifier, word in trg.items():
    if identifier.startswith("41004003"):
        word.display()

410040030011: “		 ('', True, False)
410040030021: Listen		 ('', False, True)
410040030031: !		 ('', True, False)
410040030041: A		 ('', False, True)
410040030051: farmer		 ('', False, True)
410040030061: went		 ('', False, True)
410040030071: out		 ('', False, True)
410040030081: to		 ('', False, True)
410040030091: sow		 ('', False, True)
410040030101: his		 ('', False, True)
410040030111: seed		 ('', False, True)
410040030121: .		 ('', True, False)


## Loading Alignments

Loading an alignment file requires specifying the source, target, language, and process. 

In [12]:
from bible_alignments.burrito import Catalog, Manager

# define the alignment set: language, source, target, and alignment type
alset = Catalog().get_alignmentset(language="eng", identifier="SBLGNT-BSB-manual")
mgr = Manager(alset)



        - root: /Users/sboisen/git/Clear-Bible/Alignments/data
        - source: sources/SBLGNT.tsv
        - target: targets/eng/nt_BSB.tsv
        - alignments: alignments/eng/BSB/SBLGNT-BSB-manual.json
        
Dropping 487 bad alignment records. Instances in self.malaligned.


The alignment data includes the source corpus data.

In [13]:
mgr.sourceitems["n41004003001"].display()

n41004003001: Ἀκούετε		 (Listen, ἀκούω, verb)


... and target corpus data.

In [14]:
# no leading n, includes word part '1'
mgr.targetitems["410040030021"].display()

410040030021: Listen		 ('', False, True)


The individual alignments are in `AlignmentRecord` instances, part of the `AlignmentGroup`. 

In [15]:
print(f"Source id: {mgr.alset.sourceid}")
print(f"Target id: {mgr.alset.targetid}")
print(f"# of alignment records: {len(mgr.alignmentgroup.records)}")

Source id: SBLGNT
Target id: BSB
# of alignment records: 115941


The alignment data includes the source and target data.## Example Applications

## Example Applications

### Displaying Aligned Source and Target

One application for this data is reviewing the alignments for a passage.

This example displays the aligned source and target token instances for Mark 4:3. This shows that alignments are potentially many to many. 

In [16]:
mgr["41004003"].display()

------------
Source: n41004003001: Ἀκούετε		 (Listen, ἀκούω, verb)
Source: n41004003002: ἰδοὺ		 (Behold, ἰδού, ptcl)
Target: 410040030021: Listen		 ('', False, True)
------------
Source: n41004003003: ἐξῆλθεν		 (went out, ἐξέρχομαι, verb)
Target: 410040030061: went		 ('', False, True)
Target: 410040030071: out		 ('', False, True)
------------
Source: n41004003004: ὁ		 (the [one], ὁ, det)
Target: 410040030041: A		 ('', False, True)
------------
Source: n41004003005: σπείρων		 (sowing, σπείρω, verb)
Target: 410040030051: farmer		 ('', False, True)
------------
Source: n41004003006: σπεῖραι		 (to sow, σπείρω, verb)
Target: 410040030081: to		 ('', False, True)
Target: 410040030091: sow		 ('', False, True)
Target: 410040030101: his		 ('', False, True)
Target: 410040030111: seed		 ('', False, True)


A Panda dataframe provides an easy way to compare all the alignments as a matrix. We use 'G' to indicate **G**old standard data. Note that the English punctuation is not aligned in the gold standard data.

In [17]:
mgr["41004003"].dataframe(hitmark="G")

Unnamed: 0,“,Listen,!,A,farmer,went,out,to,sow,his,seed,.
Ἀκούετε,,G,,,,,,,,,,
ἰδοὺ,,G,,,,,,,,,,
ἐξῆλθεν,,,,,,G,G,,,,,
ὁ,,,,G,,,,,,,,
σπείρων,,,,,G,,,,,,,
σπεῖραι,,,,,,,,G,G,G,G,


### Displaying Concorded Alignments for a Term

Another application is seeing all the ways a particular data item is aligned throughout the corpus. 

This example collects the BSB alignments for ἀκούω "to hear" and displays the most common ones.

In [23]:
# source tokens
akouete = mgr.sourceitems.term_tokens(term="ἀκούετε")
[token.bcv for token in akouete]

['40010027',
 '40011004',
 '40013017',
 '40017005',
 '41004024',
 '41008018',
 '41009007',
 '42008018',
 '42009035',
 '42010024',
 '43008047',
 '43010020',
 '43014024',
 '44002033',
 '44019026',
 '48004021',
 '50001030']