# Extract Hebrew to LXX verb mappings from the CATSS dataset

The CATSS parallel dataset has been processed into JSON files in :
https://github.com/codykingham/CATSS_parsers

The dataset is not yet free of all parsing errors. But the overwhelming majority of 
lines in the dataset have been successfully parsed (greater than 99%). 

We will pick through this dataset to select some alignments of interest. At the 
moment we are most interested in verb alignments. Thus we will focus on the verbs.

## Making the connections

Several connections need to be made to sucessfully retrieve a verb
alignment to the Hebrew:

* Hebrew word in parallel dataset needs to successfully string-match
with its BHSA equivalent (requires some normalizing to get them to 
match). This is done within a verse.
* The verse of the Greek word needs to be sucessfully cross-referenced
to its LXX verse reference. We can use the Copenhagen-Alliance's
[verse mappings](https://github.com/Copenhagen-Alliance/versification-specification/tree/master/versification-mappings/standard-mappings) to
accomplish this; the parallel dataset has its own way to indicate where the
versification of Rahlfs differs from the BHS, but at the moment I do not
trust its reliability.
* the string of the Greek word within the selected parallel verse needs to be matched
with the string of the word in the morphology dataset

If these specifications are met, we have a match. The parallel will 
be stored in a dictionary format that will map the BHSA node number
to all of the word data for a matching Greek verb.

```
{
    3: {
        "utf8": "ἐποίησεν",
        "trans": "E)POI/HSEN",
        "typ": "verb",
        "styp": "VAI", 
        "lexeme": "POIE/W",
        "morph_code": "VAI.AAI3S",
        "number": "sg",
        "tense": "aorist",
        "voice": "active",
        "mood": "indc",
        "person": "3"
    },
}
```

The bhsa node number can easily be used in Text-Fabric to get data on the word:

```
T.text(3) == 'בָּרָא'
F.sp.v(3) == 'verb'
```

In [37]:
import sys
import json
import regex
import collections
from tf.app import use
from pathlib import Path
import pandas as pd

sys.path.append('../../scripts/hebrew')
from positions import PositionsTF

# configure output files
BHSA2LXX = Path('../../_private_/verb_data/bhsa2lxx.json')
LXXVERSES = Path('../../_private_/verb_data/lxx_verses.csv')

# get data locations for LXX and alignments
github_dir = Path.home().joinpath('github')
catss_repo = github_dir.joinpath('codykingham/CATSS_parsers')
para_dir = catss_repo.joinpath('JSON/parallel') 
morph_dir = catss_repo.joinpath('JSON/morphology')

# load the LXX data
para_data = [
    json.loads(file.read_text()) 
        for file in sorted(para_dir.glob('*.par.json'))
]
morph_data = [
    json.loads(file.read_text())
        for file in sorted(morph_dir.glob('*.mlxx.json'))
]

# get versification map for LXX and load it
lxx_verse_map_file = github_dir.joinpath('copenhagen_alliance/versification-specification\
/versification-mappings/standard-mappings/lxx.json')
lxx_verse_map = json.loads(lxx_verse_map_file.read_text())['mappedVerses']

# load Text-Fabric (for Hebrew BHSA data)
# and assign short-form variables for easy access to its methods
bhsa = use('bhsa')
api = bhsa.api
F, E, T, L = api.F, api.E, api.T, api.L

In [4]:
para_data[0][0]

['GEN 1:1',
 [[['בראשׁית', []]], [], [['ἐν', []], ['ἀρχῇ', []]]],
 [[['ברא', []]], [], [['ἐποίησεν', []]]],
 [[['אלהים', []]], [], [['ὁ', []], ['θεὸς', []]]],
 [[['את', []], ['השׁמים', []]], [], [['τὸν', []], ['οὐρανὸν', []]]],
 [[['ואת', []], ['הארץ', []]], [], [['καὶ', []], ['τὴν', []], ['γῆν', []]]]]

In [13]:
morph_data[0][0][:3]

['GEN 1:1',
 {'utf8': 'ἐν',
  'trans': 'E)N',
  'typ': 'prep',
  'styp': 'P',
  'lexeme': 'E)N',
  'morph_code': 'P'},
 {'utf8': 'ἀρχῇ',
  'trans': 'A)RXH=|',
  'typ': 'noun',
  'styp': 'N1',
  'lexeme': 'A)RXH/',
  'morph_code': 'N1.DSF',
  'case': 'dat',
  'number': 'sg',
  'gender': 'f'}]

In [14]:
list(lxx_verse_map.items())[:10]

[('DAG 2:1-49', 'DAN 2:1-49'),
 ('DAG 3:1-23', 'DAN 3:1-23'),
 ('DAG 3:91-97', 'DAN 3:24-30'),
 ('DAG 4:1-3', 'DAN 3:31-33'),
 ('DAG 4:4-37', 'DAN 4:1-34'),
 ('DAG 4:1-2', 'DAN 4:4-5'),
 ('DAG 5:1-30', 'DAN 5:1-30'),
 ('DAG 6:1-29', 'DAN 6:1-29'),
 ('DAG 7:1-28', 'DAN 7:1-28'),
 ('DAG 8:1-27', 'DAN 8:1-27')]

## Map Verse References

The basis of all our connections will be on verse references. We will map the 
verse references to the USX schema: 

https://ubsicap.github.io/usx/vocabularies.html

The CATSS parallel and morphology data have already been normalized with USX-style
references. The only exception being those cases such as Joshua and Judges which have
appended A or B (appended with an underscore).

The USX verse references have mappings between LXX and MT, provided by the 
Copenhagen Alliance verse mappings:
https://github.com/Copenhagen-Alliance/versification-specification/tree/master/versification-mappings/standard-mappings

We will need to make some modifications to the versification for Ezra-Nehemiah,
since the CATSS dataset follows the Greek division of the book into 2Esdras or 
Εσδρας β, and the Copenhagen mapping / USX use the Latin division for these books.
For more on this, see https://en.wikipedia.org/wiki/Esdras#Naming_conventions

The mappings we need to make are:
* 2Esdras 1:1-10:44 == Ezra
* 2Esdras 11:1-23:31 = Nehemiah

NB: we only map relevant books (i.e. those with Hebrew attestations).

We also create a dictionary, `mt2lxx_verse`, which contains 1-to-1 verse mappings
from an MT verse reference to a LXX reference. 

NB: Copenhagen Alliance verse mappings start from 0 in the Psalms, presumably for the
superscriptions (?). We do not have these verses in any of the datasets so we ignore
0 verses.



In [15]:
# filter out Ezra/Ezdras mappings
lxx_verse_map2 = {
    v1:v2 for v1, v2 in lxx_verse_map.items() 
        if not regex.match('EZR|2ES', v1)
}

# add new mappings
lxx_verse_map2.update({
    '2ES 1:1-11': 'EZR 1:1-11',
    '2ES 2:1-70': 'EZR 2:1-70',
    '2ES 3:1-13': 'EZR 3:1-13',
    '2ES 4:1-24': 'EZR 4:1-24',
    '2ES 5:1-17': 'EZR 5:1-17',
    '2ES 6:1-22': 'EZR 6:1-22',
    '2ES 7:1-28': 'EZR 7:1-28',
    '2ES 8:1-36': 'EZR 8:1-36',
    '2ES 9:1-15': 'EZR 9:1-15',
    '2ES 10:1-44': 'EZR 10:1-44',
    "2ES 11:1-11": "NEH 1:1-11",
    "2ES 12:1-20": "NEH 2:1-20",
    "2ES 13:1-38": "NEH 3:1-38",
    "2ES 14:1-17": "NEH 4:1-17",
    "2ES 15:1-19": "NEH 5:1-19",
    "2ES 16:1-19": "NEH 6:1-19",
    "2ES 17:1-72": "NEH 7:1-72",
    "2ES 18:1-18": "NEH 8:1-18",
    "2ES 19:1-37": "NEH 9:1-37",
    "2ES 20:1-40": "NEH 10:1-40",
    "2ES 21:1-36": "NEH 11:1-36",
    "2ES 22:1-47": "NEH 12:1-47",
    "2ES 23:1-31": "NEH 13:1-31",
})

mt2lxx_versemap = {v:k for k,v in lxx_verse_map2.items()}

def generate_verses(reference):
    """Split a reference range into individual references"""
    reference = reference.replace(':0', ':1') # replace zero verses; don't know why they are there
    book = reference.split()[0]
    ch_vss = reference.split()[1]
    ch = ch_vss.split(':')[0]
    try:
        vs_start, vs_end = ch_vss.split(':')[1].split('-')
    except:
        raise Exception(reference)
    refs = []
    for i in range(int(vs_start), int(vs_end)+1):
        refs.append(f'{book} {ch}:{i}')
    return refs

# expand versemap to include every verse in between the ranges
# so that a verse can be converted with a simple dict lookup
mt2lxx_verse = {}

for lxx_vss, mt_vss in lxx_verse_map2.items():
    if '-' in lxx_vss and '-' in mt_vss:
        lxx_refs = generate_verses(lxx_vss)
        mt_refs = generate_verses(mt_vss)
        mt2lxx_verse.update(zip(mt_refs, lxx_refs))
    elif '-' not in lxx_vss and '-' not in mt_vss:
        mt2lxx_verse[mt_vss] = lxx_vss
    else:
        raise Exception('NB: a not 1-to-1 mapping found')

In [16]:
# e.g.
mt2lxx_verse['PSA 115:1']

'PSA 113:9'

### Map BHSA verse references to USX abbreviations

In [18]:
bhsa2usx = {
    'Genesis': 'GEN',
    'Exodus': 'EXO',
    'Leviticus': 'LEV',
    'Numbers': 'NUM',
    'Deuteronomy': 'DEU',
    'Joshua': 'JOS_B', # NB: going with B col for now
    'Judges': 'JDG_A', # NB: going with A col for now
    '1_Samuel': '1SA',
    '2_Samuel': '2SA',
    '1_Kings': '1KI',
    '2_Kings': '2KI',
    'Isaiah': 'ISA',
    'Jeremiah': 'JER',
    'Ezekiel': 'EZE',
    'Hosea': 'HOS',
    'Joel': 'JOL',
    'Amos': 'AMO',
    'Obadiah': 'OBA',
    'Jonah': 'JON',
    'Micah': 'MIC',
    'Nahum': 'NAM',
    'Habakkuk': 'HAB',
    'Zephaniah': 'ZEP',
    'Haggai': 'HAG',
    'Zechariah': 'ZEC',
    'Malachi': 'MAL',
    'Psalms': 'PSA',
    'Job': 'JOB',
    'Proverbs': 'PRO',
    'Ruth': 'RUT',
    'Song_of_songs': 'SNG',
    'Ecclesiastes': 'ECC',
    'Lamentations': 'LAM',
    'Esther': 'EST',
    'Daniel': 'DAN',
    'Ezra': 'EZR',
    'Nehemiah': 'NEH',
    '1_Chronicles': '1CH',
    '2_Chronicles': '2CH',
}

### Map data to verse references

Now we map the parallel and morphology data to verse references.
These will serve as the primary basis for building the connections
between the datasets.

In [19]:
verse2para = {}
verse2morph = {}

for dataset, data_dict in [(para_data, verse2para), (morph_data, verse2morph)]:
    for book in dataset:
        for verse in book:
            ref = verse[0]
            verse_data = verse[1:]
            data_dict[ref] = verse_data

In [20]:
verse2para['GEN 1:1'] # Now we can access the data by verse references

[[[['בראשׁית', []]], [], [['ἐν', []], ['ἀρχῇ', []]]],
 [[['ברא', []]], [], [['ἐποίησεν', []]]],
 [[['אלהים', []]], [], [['ὁ', []], ['θεὸς', []]]],
 [[['את', []], ['השׁמים', []]], [], [['τὸν', []], ['οὐρανὸν', []]]],
 [[['ואת', []], ['הארץ', []]], [], [['καὶ', []], ['τὴν', []], ['γῆν', []]]]]

In [21]:
verse2morph['GEN 1:1'][:2]

[{'utf8': 'ἐν',
  'trans': 'E)N',
  'typ': 'prep',
  'styp': 'P',
  'lexeme': 'E)N',
  'morph_code': 'P'},
 {'utf8': 'ἀρχῇ',
  'trans': 'A)RXH=|',
  'typ': 'noun',
  'styp': 'N1',
  'lexeme': 'A)RXH/',
  'morph_code': 'N1.DSF',
  'case': 'dat',
  'number': 'sg',
  'gender': 'f'}]

## Loop through BHSA candidates and build the connections

In [24]:
missed_lxx_refs = set()

def get_greek_morphology(word, ref):
    """Look up a word's morphology data based on its reference."""
    lxx_ref = mt2lxx_verse.get(ref, ref) # NB the MT to LXX verse mapping
    
    # no parallel in the Greek
    # TODO: double check this status
    if lxx_ref not in verse2morph:
        missed_lxx_refs.add(ref)
        return None
    
    for word_data in verse2morph[lxx_ref]:
        if word_data['utf8'] == word:
            word_data['LXX_verse'] = lxx_ref
            return word_data

In [25]:
# e.g.
ref = 'GEN 1:1'
get_greek_morphology(verse2para[ref][0][2][1][0], ref)

{'utf8': 'ἀρχῇ',
 'trans': 'A)RXH=|',
 'typ': 'noun',
 'styp': 'N1',
 'lexeme': 'A)RXH/',
 'morph_code': 'N1.DSF',
 'case': 'dat',
 'number': 'sg',
 'gender': 'f',
 'LXX_verse': 'GEN 1:1'}

In [26]:
match_ref = regex.compile(r'([A-Z0-9_]+) (\d+)?:?(\d+)')

def update_verse(verse_ref, tc_notes):
    """Iterate through text-critical notation and update verse if necessary."""
    book, chapter, verse = match_ref.match(verse_ref).groups()
    for note in tc_notes:
        if note.isnumeric():
            verse = note
            break
    return f'{book} {chapter}:{verse}'

In [27]:
bhsa2lxx = {}
missed = []
bad_mtverses = set()

in_clause = lambda node1, node2: node2 in L.d(L.u(node1,'clause')[0], 'word')

# begin making the connections
for verb in F.pdp.s('verb'):
    
    P = PositionsTF(verb, 'clause', api).get
    
    # gather various candidates to attempt to match with the 
    # Hebrew parallel data stored in the LXX parallels dataset
    possible_nodes = [(verb,)]
    
    # handle attached elements to normalize with LXX
    if P(-1, 'lex') == 'W':
        bhsa_nodes = (verb-1, verb)
        possible_nodes.append(bhsa_nodes)
    elif F.vt.v(verb).startswith('inf') and P(-1, 'pdp') == 'prep':        
        bhsa_nodes = (verb-1, verb)
        possible_nodes.append(bhsa_nodes)
        if P(-2,'lex') == 'W':
            bhsa_nodes = (P(-2),) + bhsa_nodes
            possible_nodes.append(bhsa_nodes)
    elif F.vt.v(verb).startswith('ptc') and P(-1, 'lex') == 'H':
        bhsa_nodes = (verb-1, verb)
        possible_nodes.append(bhsa_nodes)
        
    # assemble the strings
    possible_strings = set()
    for pnn in possible_nodes:
        possible_strings.add(''.join(F.g_cons_utf8.v(n) for n in pnn))
    
    book, chapter, verse = T.sectionFromNode(verb)
    usx_book = bhsa2usx[book]
    if usx_book == 'OBA':
        mt_ref = f'{usx_book} {verse}'
    else:
        mt_ref = f'{usx_book} {chapter}:{verse}'
    
    # attempt link to parallel data line on basis of Hebrew text
    link = {}
    try:
        para_cols = verse2para[mt_ref]
    except:
        bad_mtverses.add(mt_ref)
        continue
        
    for heb_colA, heb_colB, grk_col in para_cols :
        
        if not heb_colA or heb_colA[0] == 'PARSING_ERROR':
            continue
        
        # attempt Hebrew connection to identify the right data line
        heb_match = None
        for word, tc_note in heb_colA:
            if word in possible_strings:
                heb_match = word
                break
          
        if not heb_match:
            continue
            
        # we have the right data line
        # now attempt to make Greek connection
        greek_match = None
        if heb_match:
            for greek_word, tc_note in grk_col:
                
                # modify reference
                lxx_ref = update_verse(mt_ref, tc_note)
                morph = get_greek_morphology(greek_word, lxx_ref)
                if morph and morph['typ'] == 'verb':
                    greek_match = morph
                    break
                    
        # we've made a connection!
        # save the data and move on
        if greek_match:
            #verb_data = (verb, mt_ref, F.g_cons_utf8.v(verb))
            link[verb] = greek_match
            
    # record links and missed links
    if link:
        bhsa2lxx.update(link)
    else:
        missed.append((verb, mt_ref))
        
print(f'DONE')
percent_done = round(len(bhsa2lxx) / (len(bhsa2lxx)+len(missed)), 2)*100
percent_missed = round(len(missed) / (len(bhsa2lxx)+len(missed)), 2)*100
print(f'\tlinked: {len(bhsa2lxx)}', f'({percent_done}%)')
print(f'\tmissed: {len(missed)}', f'{percent_missed}%')

DONE
	linked: 62693 (91.0%)
	missed: 6315 9.0%


In [28]:
bad_mtverses # NB: these need to be fixed in the parallel database

{'DEU 4:25', 'EZE 34:27', 'EZE 44:18', 'JOS_B 22:34'}

In [29]:
#verse2morph['GEN 3:17']

In [30]:
for verb, ref in missed[:0]:
    print(verb)
    print(F.g_cons_utf8.v(verb))
    print(ref)
    for line in verse2para[ref]:
        print('\t', line)
    print()

In [35]:
# Export LXX Data JSON
with open(OUTFILE, 'w') as out:
    json.dump(bhsa2lxx, out, ensure_ascii=False, indent=2)

In [46]:
# Assemble and export plain text verse data for LXX in a CSV
rows = []
for verse, words in verse2morph.items():
    row = {}
    row['LXX_verse'] = verse
    row['text'] = ' '.join(w['utf8'] for w in words)
    rows.append(row)
    
df = pd.DataFrame(rows)

df.head()

Unnamed: 0,LXX_verse,text
0,GEN 1:1,ἐν ἀρχῇ ἐποίησεν ὁ θεὸς τὸν οὐρανὸν καὶ τὴν γῆν
1,GEN 1:2,ἡ δὲ γῆ ἦν ἀόρατος καὶ ἀκατασκεύαστος καὶ σκότ...
2,GEN 1:3,καὶ εἶπεν ὁ θεός γενηθήτω φῶς καὶ ἐγένετο φῶς
3,GEN 1:4,καὶ εἶδεν ὁ θεὸς τὸ φῶς ὅτι καλόν καὶ διεχώρισ...
4,GEN 1:5,καὶ ἐκάλεσεν ὁ θεὸς τὸ φῶς ἡμέραν καὶ τὸ σκότο...


In [47]:
df.shape

(30609, 2)

In [48]:
df.to_csv(LXXVERSES, index=False)