# Build English Verb Dataset

In this notebook, we build a translation dataset based on NIV and ESV translation
alignments provided by GBI. The GBI data, which uses an underlying WLC Hebrew text
has already been aligned to the Amsterdam BHSA Hebrew dataset in 
[GBI_alignment_wrangling.ipynb](GBI_alignment_wrangling.ipynb). We can thus take
advantage of both databases and their associated data when building our dataset here.

In the dataset we'll attempt to parse the English text so that the syntax and (especially)
the verbal forms can be analyzed alongside the Hebrew grammar. We'll start out with 
Spacy for the English parsings. 

In [1]:
import sys
import re
import json
import collections
import re
import pandas as pd
pd.set_option('display.max_rows', 200)
from pathlib import Path
from tf.app import use
from bidict import bidict # bidirectional dictionary

# custom modules 
sys.path.append('..')
import tf_tools
from gbi_functions import id2ref
from positions import PositionsTF

# organize pathways
PROJ_DIR = Path.home().joinpath('github/CambridgeSemiticsLab/translation_traditions_HB')
PRIVATE_DATA = PROJ_DIR.joinpath('data/_private_')
GBI_DATA_DIR = PRIVATE_DATA.joinpath('GBI_alignment')
VERB_DIR = PRIVATE_DATA.joinpath('verb_data')

# load GBI data
gbi_niv = json.loads(GBI_DATA_DIR.joinpath('niv84.ot.alignment.json').read_text())
gbi_esv = json.loads(GBI_DATA_DIR.joinpath('esv.ot.alignment.json').read_text())

# load BHSA / GBI Alignment
bhsa2gbi = json.loads(GBI_DATA_DIR.joinpath('bhsa2gbi.json').read_text())

# load BHSA data and methods
bhsa = use('bhsa')
api = bhsa.api
F, E, T, L = api.F, api.E, api.T, api.L

In [2]:
# set up some dictionaries for convenient word data access
# see 'Dict demos' in next cells for quick intro to the resulting dict structures

sources = (('niv', gbi_niv), ('esv', gbi_esv))
word_data = collections.defaultdict(lambda: collections.defaultdict(list))
verse2words = collections.defaultdict(lambda: collections.defaultdict(list)) 
linkbyid = collections.defaultdict(list) # list of 2-tuples, each containing word IDs
id2link = collections.defaultdict(dict) # select a link based on a single ID 

for name, source in sources:
    for verse in source:
        
        # unpack words for processing
        trans_words =  verse['translation']['words']
        manu_words = verse['manuscript']['words']
        
        # map translation word data
        for w in trans_words:
            ref_tuple = id2ref(w['id'], 'translation')
            verse2words[name][ref_tuple].append(w['id'])
            word_data[name][w['id']] = w
        
        # map WLC word data
        # arbitrarily use the copy stored under NIV
        if name == 'niv':
            for w in manu_words:
                ref_tuple = id2ref(w['id'])
                verse2words['wlc'][ref_tuple].append(w['id'])
                word_data['wlc'][w['id']] = w
                
        # map links to word ids
        # the alignment data just contains indices pointing
        # to the various lists, so these have to be used to 
        # identify the specific word in question
        for wlc_indices, trans_indices in verse['links']:
            wlc_ids = tuple(manu_words[i]['id'] for i in wlc_indices)
            trans_ids = tuple(sorted(trans_words[i]['id'] for i in trans_indices))
            linkbyid[name].append((wlc_ids, trans_ids))
            for wid in wlc_ids:
                id2link[name][wid] = trans_ids

### Dict demos

In [3]:
word_data['wlc'][10010010021]

{'id': 10010010021,
 'altId': 'בָּרָ֣א\u200e-1',
 'text': 'בָּרָ֣א\u200e',
 'strong': 'H1254',
 'gloss': 'he created',
 'gloss2': '创造',
 'lemma': 'ברא_1',
 'pos': 'verb',
 'morph': 'vqp3ms'}

In [4]:
word_data['niv'][1001001005]

{'id': 1001001005,
 'altId': 'created-1',
 'text': 'created',
 'transType': 'k',
 'isPunc': False,
 'isPrimary': True}

In [5]:
verse2words['niv'][('Genesis', 1, 1)]

[1001001001,
 1001001002,
 1001001003,
 1001001004,
 1001001005,
 1001001006,
 1001001007,
 1001001008,
 1001001009,
 1001001010,
 1001001011]

In [6]:
for wlc_ids, trans_ids in linkbyid['niv']:
    if 10010010021 in wlc_ids:
        print(wlc_ids, trans_ids)

(10010010021,) (1001001005,)


In [7]:
# NB: similar to above, however
# the link is only 1-to-X
# so some parts of the left side of the link could be missing
id2link['niv'][10010010021]

(1001001005,)

## Sub-sample verbs

The basis of the dataset is verbs. The jumping-off point is the BHSA syntax data. Thus what
we do is assemble the dataset by the BHSA verbs.

First we get a one-to-one verb mapping between BHSA and GBI Hebrew (WLC). We can use the GBI hebrew
links to select the correct words in the translations.

The main stipulation for the selection is agreement between BHSA and WLC on 
the classification of a word as a verb (e.g. טוֹב); for BHSA classification, 
we use a contextual definition (phrase-dependent) which classifies whether 
the word is behaving as a verb in context (e.g. participles)

Since some alignments between BHSA and WLC are `many-to-N` or `N-to-many`, 
we also filter out any of these non-verbal words. This leaves us with
a 1-to-1 alignment, so that 1 verb in BHSA equals 1 verb in WLC.

In [8]:
# track where BHSA and WLC disagree on the classification of a verb
no_match = {
    'disagree': [],
}

verb_bhsa2gbi = {}

# find 1-to-1 matches of BHSA and WLC verbs
for bhsa_nodes, gbi_ids in bhsa2gbi:
    
    # filter out non-verbs from the links
    bhsa_verbs = [w for w in bhsa_nodes if F.pdp.v(w) == 'verb'] 
    wlc_verbs = [w for w in gbi_ids if word_data['wlc'][w]['pos'] == 'verb']
    data = (T.text(bhsa_nodes), T.sectionFromNode(bhsa_nodes[0]), bhsa_nodes, gbi_ids) # track null matches
    
    # one case, Jer 51:3, has a double verb mapping caused by 
    # ידרך ידרך, which BHSA maps to a single word node, and gbi 
    # keeps as 2 words; we disambig that here and keep only 
    # first gbi word
    if bhsa_verbs and bhsa_verbs[0] == 262780:
         wlc_verbs = wlc_verbs[:1]
    
    # skip non-verbal contexts
    if not bhsa_verbs + wlc_verbs:
        continue
    
    # track disagreements between 2 sources
    elif (bhsa_verbs and not wlc_verbs) or (wlc_verbs and not bhsa_verbs):
        no_match['disagree'].append(data)
    
    # store result both ways: bhsa 2 wlc, wlc 2 bhsa
    elif len(bhsa_verbs) == 1 and len(wlc_verbs) == 1:
        
        # make a subset selection of verbs
        bhsa_verb, wlc_verb = bhsa_verbs[0], wlc_verbs[0]
        verb_bhsa2gbi[bhsa_verb] = word_data['wlc'][wlc_verb]
        
    
    # or there's a problem...
    else:
        raise Exception(f'Misalignment at {data}')
        
print(sum(len(v) for v in no_match.values()), 'verbs do not match requirements')
print(len(verb_bhsa2gbi), 'selected for building dataset')

4427 verbs do not match requirements
68826 selected for building dataset


NB that the ~4.4k verbs in disagreement is because we use contextual parts of speech
from the BHSA dataset. The GBI dataset does not seem to be as sensitive to context for
pos. A large proportion of these cases are participles used as nouns rather than verbs.

In [9]:
no_match['disagree'][500:510]

[('פְּקֻדֵיהֶ֖ם ', ('Numbers', 1, 39), [70153], [40010390011, 40010390012]),
 ('יֹצֵ֥א ', ('Numbers', 1, 40), [70183], [40010400141]),
 ('פְּקֻדֵיהֶ֖ם ', ('Numbers', 1, 41), [70185], [40010410011, 40010410012]),
 ('יֹצֵ֥א ', ('Numbers', 1, 42), [70214], [40010420141]),
 ('פְּקֻדֵיהֶ֖ם ', ('Numbers', 1, 43), [70216], [40010430011, 40010430012]),
 ('פְּקֻדִ֡ים ', ('Numbers', 1, 44), [70229], [40010440022]),
 ('פְּקוּדֵ֥י ', ('Numbers', 1, 45), [70250], [40010450031]),
 ('פְּקֻדִ֔ים ', ('Numbers', 1, 46), [70271], [40010460032]),
 ('פְקֻדֵיהֶ֑ם ', ('Numbers', 2, 4), [70477], [40020040022, 40020040023]),
 ('פְקֻדָ֑יו ', ('Numbers', 2, 6), [70502], [40020060022, 40020060023])]

We export the dataset for later processing.

## Parsing English with Spacy

For understanding the basics of Spacy, see:
https://spacy.io/usage/linguistic-features

For each verb in the `select_verbs` dictionary, we retrieve its verse text in a
given translation. The translated text is parsed by Spacy, which supplies us with
a dependency tree, parts of speech, and verb tenses for the English side of things.

For Spacy tags used with the model of choice, see:
https://github.com/explosion/spacy-models/releases//tag/en_core_web_sm-2.3.1

In [10]:
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Doc, Token, Span
from spacy.util import filter_spans # filter overlaps; nice tip: https://stackoverflow.com/a/63303480/8351428
from spacy.gold import align # align different tokenizations: https://spacy.io/usage/linguistic-features#aligning-tokenization

### Build links between BHSA and English translations via WLC

In [11]:
# build dataset

english_verbs = collections.defaultdict(dict)
not_linked = collections.defaultdict(list)

for trans in ('esv', 'niv'):
    
    for bhsa_node, wlc_word in verb_bhsa2gbi.items():
        try:
            english_verbs[trans][bhsa_node] = id2link[trans][wlc_word['id']]
        except: 
            not_linked[trans].append(wlc_word)
        
    print(f'n-verbs selected for {trans}: {len(english_verbs[trans])}')
    print(f'\tn-verbs unlinked: {len(not_linked[trans])}')

n-verbs selected for esv: 66837
	n-verbs unlinked: 1989
n-verbs selected for niv: 65455
	n-verbs unlinked: 3371


### Parse English Data within its Verse Context

In [12]:
# to avoid double-parsing verses, we map
# all cases needed to be parsed to a single
# parsed verse

# for each translation, gather the verse labels
# that need to be parsed

versestoparse = collections.defaultdict(set)
for trans in english_verbs:
    for bhsa_node, trans_words in english_verbs[trans].items():
        ref = id2ref(trans_words[0], 'translation')
        versestoparse[trans].add(ref)
        
    print(f'{trans} has {len(versestoparse[trans])} verses to parse')

esv has 21264 verses to parse
niv has 21251 verses to parse


### Apply raw Spacy parsing to all relevant verses

In [13]:
# set up the Spacy processor as well as some customized attributes
nlp = spacy.load('en_core_web_sm')
Span.set_extension('tam_tag', default='', force=True)
Token.set_extension('my_span', default=None, force=True)

In [14]:
def parse_verse(verse_tuple, translation):
    """Parse translation verse with Spacy."""
    word_ids = verse2words[translation][verse_tuple]
    words = [word_data[translation][w] for w in word_ids]
    text = ' '.join(w['text'] for w in words)
    parsed_doc = nlp(text) # magic happens here
    return parsed_doc

def parse_verses(transdict):
    """Iterate through all verses and parse them.
    
    Args:
        versedict: dict with structure of e.g. {'niv': set(('Genesis', 1, 1)...}}
    Returns:
        dict w/ structure of e.g. {'niv': {('Genesis', 1, 1): Spacy.Doc}}
    """
    
    parsed_verses = collections.defaultdict(dict)
    
    bhsa.indent(0, reset=True)
    bhsa.info(f'Parsing translations...')
    
    for translation, verse_set in transdict.items():
    
        # it takes a long time so we time it
        bhsa.indent(1, reset=True)
        bhsa.info(f'Beginning {translation}...')
        bhsa.indent(2, reset=True)
    
        # parse the verse and put Spacy.Doc in a dict
        for i, ref_tuple in enumerate(verse_set):
            
            if i % 5000 == 0 and i != 0:
                bhsa.info(f'done with verse {i}')
                
            parsed_verses[translation][ref_tuple] = parse_verse(ref_tuple, translation)
            
        bhsa.indent(1)
        bhsa.info('done!')
    
    return parsed_verses

Now we execute the parser for all verses. This will take some time!

In [15]:
import pickle

In [16]:
# toggle here to run fresh parsings
parsed_verses_file = PRIVATE_DATA.joinpath('parsings/parsed_verses.pickle')
if False:
    parsed_verses = parse_verses(versestoparse)
    with open(parsed_verses_file, 'wb') as outfile:
        pickle.dump(parsed_verses, outfile)
else:
    with open(parsed_verses_file, 'rb') as infile:
        parsed_verses = pickle.load(infile)

In [17]:
parsed_verses['niv'][('Genesis', 11, 3)]

They said to each other , “ Come , let’s make bricks and bake them thoroughly . ” They used brick instead of stone , and tar for mortar .

### Set up Matcher rules for advanced TAM tags

Spacy parses raw strings into tags and dependencies. For verbs we are particularly
interested in tense, aspect, and modality (TAM). The default tags are not very informative with 
regard to TAM. But we can also achieve these labels ourselves by adding some additional rules.

We will use Spacy's Matcher class for this, alongside the parser:

https://spacy.io/usage/rule-based-matching

To-do list of primary English tense constructions, curated from:

https://en.wikipedia.org/wiki/English_verbs#Expressing_tenses,_aspects_and_moods

```
simple present            writes
simple past               wrote
present progressive       is writing
past progressive          was writing
present perfect           has written
past perfect              had written
present perf. progress.   has been writing
past perf. progress.      had been writing
future                    will write
future perfect            will have written
future perf. progress.    will have been writing
```

secondary constructions:

```
imperative               write
future-in-past           would write
do-support               does write
be-going-to future       is going to write
```

Later on, we can consider dividing these constructions up into 3 columns -- 1 each for 
tense, aspect, and modality. If a construction contributes to one of these categories,
the column gets filled. Otherwise it is left empty. 

```
"has been writing"

tense           aspect            modality
-----           ------             ------
past      perfect progressive
```

In [18]:
# a set of rules to match tense-aspect-modality construtions in English
# overlapping results will be filtered out and the longest matching span
# will be kept in its place

# these patterns can be inserted between verb auxiliaries
# and their heads to represent any number of interrupting
# adverbial modifiers
advb_pronouns = {'TAG': {'IN':['RB', 'PRP']}, 'OP': '*'}
advbs = {'TAG': {'IN':['RB']}, 'OP': '*'}
non_verbs = {'TAG': {'NOT_IN':['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']}, 'OP': '*'}
modal_set = ['let', 'may', 'shall', 'must', 'could', 'can']
    
tam_rules = [
    (
        'PRES (PRES..IND)', 
        [
            {'TAG':{'IN':['VBZ', 'VBP']}, 'DEP': {'NOT_IN': ['aux']}},
        ]
    ),
    (
        'PRES PROG (PRES.PROG.IND)', 
        [
            {'TAG': {'IN':['VBZ', 'VBP']}, 'LEMMA':'be'},
            advb_pronouns,
            {'TAG':'VBG', 'LEMMA': {'NOT_IN':['go']}},
        ]
    ),
    (
        'PRES PERF (PRES.PERF.IND)',
        [
            {'TAG': {'IN': ['VBZ', 'VBP']}, 'LEMMA': {'REGEX': 'have'}},
            advb_pronouns,
            {'TAG': 'VBN', 'DEP': {'NOT_IN': ['aux']}},
        ]
    ),
    (
        'PRES PERF PROG (PRES.PERF_PROG.IND)',
        [
            {'TAG': {'IN': ['VBZ', 'VBP']}, 'LEMMA': 'have'},
            advb_pronouns,
            {'TAG': 'VBN', 'LEMMA': 'be'},
            {'TAG': 'VBG'},
        ]
    ),
    (
        'PAST (PAST..IND)',
        [
            {'TAG': 'VBD', 'DEP': {'NOT_IN':['aux']}},
        ]
    ),
    (
        'PAST PERF (PAST.PERF.IND)',
        [
            {'TAG': {'IN': ['VBD']}, 'LEMMA': 'have'},
            advb_pronouns,
            {'TAG': 'VBN', 'DEP': {'NOT_IN': ['aux']}},
        ]
    ),
    (
        'PAST PERF (PAST.PERF.IND)', # with 'did'
        [
            {'TAG': {'IN': ['VBD']}, 'LEMMA': 'do', 'DEP': 'aux'},
            non_verbs,
            {'TAG': 'VB'},
        ]
    ),
    (
        'PAST PERF PROG (past.PERF_PROG.IND)',
        [
            {'TAG': {'IN': ['VBD']}, 'LEMMA': 'have'},
            advb_pronouns,
            {'TAG': 'VBN', 'LEMMA': 'be'},
            {'TAG': 'VBG'},
        ]
    ),
    (
        'FUT PERF (FUT.PERF.IND)',
        [
            {'TAG': 'MD', 'LEMMA': 'will'},
            advb_pronouns,
            {'TAG': {'IN': ['VB']}, 'LEMMA': 'have'},
            advb_pronouns,
            {'TAG': 'VBN', 'DEP': {'NOT_IN': ['aux']}},
        ]
    ),
    (
        'FUT PERF PROG (FUT.PERF_PROG.IND)',
        [
            {'TAG': 'MD', 'LEMMA': 'will'},
            advb_pronouns,
            {'TAG': {'IN': ['VB']}, 'LEMMA': 'have'},
            advb_pronouns,
            {'TAG': 'VBN', 'LEMMA': 'be'},
            {'TAG': 'VBG'},
        ]
    ),

    (
        'PAST PROG (PAST.PROG.IND)',
        [
            {'TAG':'VBD', 'LEMMA': {'IN': ['be', 'keep']}},
            advb_pronouns,
            {'TAG': 'VBG'},
        ]
    ),
    (
        'FUT (FUT..IND)',
        [
            {'TAG': 'MD', 'LEMMA': {'REGEX':'[wW]ill'}, 'DEP': 'aux'},
            non_verbs,
            {'TAG': 'VB', 'DEP': {'NOT_IN': ['aux']}},
        ]
    ),
    (
        'FUT-IN-PAST (PAST..SUBJ)', # habitual?
        [
            {'LOWER': 'would', 'DEP': {'IN': ['aux']}},
            advb_pronouns,
            {'TAG':'VB'}
        ]
    ),
    (
        'DO PRES (PRES..IND)',
        [
            {'TAG': {'IN': ['VBZ', 'VBP']}, 'LEMMA': 'do', 'LOWER': {'NOT_IN':['do']}},
            advb_pronouns,
            {'TAG': 'VB'},
        ]
    ),
    (
        'GOING TO (FUT..MOD)', # going to
        [
            {'TAG': {'IN':['VBZ', 'VBP']}, 'LEMMA':'be'}, 
            advb_pronouns,
            {'TAG': 'VBG', 'LEMMA': 'go'},
            {'TAG': 'TO'},
            {'TAG': 'VB'},
        ]
    ),
    (
        'MOD (PRES..MOD)',
        [
            {'TAG': {'IN':['VB', 'MD']}, 'LEMMA': {'IN':modal_set}},
            {'TAG': {'NOT_IN':['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']}, 'OP': '*'},
            {'TAG': 'VB'},
        ]
    ),
    (
        'PRES PART (PRES..)', 
        [
            {'TAG': 'VBG', 'DEP': {'NOT_IN':['aux']}},
        ]
    ),
    (
        'PRES PART (PRES..)',
        [
            {'TAG': {'IN': ['JJ']}, 'LOWER':{'REGEX':'^.+ing$'}},
        ]
    ),
    (
        'TO INF (..)',
        [
            {'LOWER': 'to'},
            {'TAG':'VB'}
        ]
    ),
    # the imperative in English consists of the base
    # form of a verb; but this form can of course
    # be used within a number of verb constructions
    (
        'IMPV (PRES..IMPV)',
        [
            {'TAG': 'VB', 'DEP':{'NOT_IN':['aux']}},
        ]
    ),
    (
        'IMPV (PRES..IMPV)',
        [
            {'TAG': 'VB', 'LEMMA': {'REGEX':'[dD]o'}},
            {'LEMMA': 'not'},
            {'TAG': 'VB'}
        ]
    ),

]

def test_span_impv(span):
    """Test whether span truly contains an imperative"""
    for token in span:
        if test_impv(token):
            return True
    return False

def check_span(span):
    """Double checks matches and adjusts as needed"""
    if span._.tam_tag == 'imperative':
        if not test_span_impv(span):
            span._.tam_tag = '' # remove the label

matcher = Matcher(nlp.vocab)

# add all rules to the matcher object
for tag, rules in tam_rules:
    matcher.add(tag, None, rules)

<hr>
Testing area.

### TODO: Gen 19:9, tagged as imperative when it's not

Others:
* gen 20:7, "if you do not return her" -- should in general treat "do" different
* lots of futures get inheritance when they shouldn't

In [19]:
nlp2 = spacy.load('en_core_web_sm')

In [20]:
test = nlp2('Did you do that?')

In [21]:
tok = test[0]

tok

Did

In [22]:
tok.lemma_

'do'

In [23]:
def display_tokens(doc):
    for i, token in enumerate(doc):
        print(i, token)
        print('  ', token.lemma_)
        print('  ', token.pos_)
        print('  ', token.tag_)
        print('  ', token.dep_)
        print('  ', token.shape_)
        
#display_tokens(parsed_verses['niv'][('Genesis', 1, 24)])

In [24]:
def test_impv(token, debug=False):
    """Test whether a verb is an imperative."""
    ancestors = list(token.ancestors)
    subtree = list(token.subtree)
    ancest_tags = set(t.tag_ for t in ancestors if t.tag_.startswith('VB'))
    ancest_lemmas = set(t.lemma_ for t in ancestors)
    subtree_deps = set(t.dep_ for t in subtree)
    head_deps = set(t.dep_ for t in token.head.children)
    ancest_lemmas = set(t.lemma_ for t in ancestors)
    clause_pos = subtree.index(token)
    modal_set = {
        'let', 'may', 
        'shall', 'must', 'can'
    }
    auxs = {'aux', 'auxpass'}
    
    ancestor_rules = [
        ancest_tags.issubset({'VB'}),
        token.dep_ != 'conj',
    ]
    if ancestors:
        ancestor_rules.append(test_impv(token.head))
    position_rules = [
        clause_pos == 0,
        not ancestors,
    ]
    
    # check for boundary
    rules = [
        token.tag_ == 'VB',
        token.dep_ not in auxs,
        any(ancestor_rules),
        not ancest_lemmas & modal_set,
        not head_deps & auxs,
        any(position_rules),
        token.lemma_ not in modal_set,
    ]
    # specify that word must occur at 
    # a major boundary
    if token.i != 0:
        pre_t = token.doc[token.i-1]
        punct_rules = any([
            pre_t.is_punct,
            pre_t.lemma_ == 'and',
            (pre_t.is_title and pre_t.tag_ == 'RB'),
            pre_t.lemma == 'please',
        ])
        rules.append(punct_rules)
    else:
        rules.append(token.is_title)
    
    
    if debug:
        print(rules)
        print('anc rules:', ancestor_rules)
        print('pos rules:', position_rules)
    
    if all(rules):
        return True
    else:
        return False
    

In [25]:
test_parse = parsed_verses['niv'][('1_Samuel', 2, 3)]
# for token in test_parse:
#     is_impv = test_impv(token)
#     print(token, f'\tis impv: {is_impv}')
    
#display_tokens(test_parse)

print(test_parse)
print()
for i,token in enumerate(test_parse):
    print(i, token, end='  ') 

“ Do not keep talking so proudly or let your mouth speak such arrogance , for the LORD is a God who knows , and by him deeds are weighed .

0 “  1 Do  2 not  3 keep  4 talking  5 so  6 proudly  7 or  8 let  9 your  10 mouth  11 speak  12 such  13 arrogance  14 ,  15 for  16 the  17 LORD  18 is  19 a  20 God  21 who  22 knows  23 ,  24 and  25 by  26 him  27 deeds  28 are  29 weighed  30 .  

In [26]:
ex = test_parse[1]

ex

Do

In [27]:
list(ex.children)

[]

In [28]:
ex.lemma_

'do'

In [29]:
ex.tag_

'VBP'

In [30]:
ex.dep_

'aux'

In [31]:
list(ex.ancestors)

[keep]

In [32]:
[t.tag_ for t in ex.ancestors if t.pos_ == 'VERB']

['VB']

In [33]:
list(ex.conjuncts)

[]

In [34]:
list(ex.subtree)

[Do]

In [35]:
ex.head

keep

In [36]:
test_impv(ex, debug=True)

[False, False, True, True, False, True, True, True]
anc rules: [True, True, False]
pos rules: [True, False]


False

In [37]:
spacy.explain('xcomp')

'open clausal complement'

<hr>

In [38]:
def attach_span(spans):
    """Connect a token to its span explicitly."""
    for span in spans:
        for token in span:
            token._.my_span = span

# TODO: consider how to do this recursively
# TODO: clean up fugly code
def endow_tam(spans):
    """Pass down a TAM category from a head to its conjuncts."""
    for span in spans:
        for token in span:
            if token.tag_ != 'VB':
                continue
            conjuncts = token.conjuncts
            for conj in conjuncts:
                 # apply to non-modified verbal stems
                if (cspan := conj._.my_span) and conj.tag_ == 'VB':
                    if len(list(cspan)) == 1:               
                        conj._.my_span._.tam_tag = span._.tam_tag
    
# for every verse isolate the set of relevant spans
# which match the TAM rules and map to verses
verse2spans = collections.defaultdict(dict)
for trans, ref_tuples in parsed_verses.items():
    for ref_tuple, spacy_doc in ref_tuples.items():
        matches = matcher(spacy_doc)
        
        # retrieve Spacy Span objects
        # and give them TAM tags
        spans = []
        for m_id, start, end in matches:
            span = spacy_doc[start:end]
            span._.tam_tag = nlp.vocab.strings[m_id]
            check_span(span)
            spans.append(span)
        
        # filter out overlapping spans and keep 
        # only the longest strings
        filtered_spans = filter_spans(spans)
        #attach_span(filtered_spans)
        #endow_tam(filtered_spans)
        
        # save positive matches; unmatched verses will
        # be recognized later
        if filtered_spans:
            verse2spans[trans][ref_tuple] = filtered_spans
        else:
            continue
            
    print(f'{trans} n-spans: {len(verse2spans[trans])}')

esv n-spans: 21210
niv n-spans: 21184


### Link spans to verbs

We will now attempt to re-link the spans with the verbs.

In [39]:
# test = [str(t) for t in parsed_verses['niv'][('Genesis', 11, 3)]]
# gbi_words = [word_data['niv'][w]['text'] for w in verse2words['niv'][('Genesis', 11, 3)]]
# gbi_ids = verse2words['niv'][('Genesis', 11, 3)]

In [40]:
# for i, token in enumerate(parsed_verses['niv'][('Genesis', 11, 3)]):
#     gid = gbi_ids[a2b_multi.get(i, a2b[i])]
#     #print(token, gid, word_data['niv'][gid]['text'])

In [41]:
def trans_to_span(para_words, spans, 
                  verse_words, aligner):
    """Match given words with its TAM span.
    
    A match is an overlap of known parallel words and 
    a span of matched words, based on overlapping GBI ids.
    Thus all Spacy token indicies are converted to GBI indicies
    and used to lookup the corresponding GBI ids for set comparison.
    
    Args:
        para_words: list of gbi word ids for a known parallel alignment 
        spans: list of Spacy Span objects from the Matcher, with tam_tag attributes
        verse_words: list of gbi ids within a verse; is indexed with remapped indices
            from the Span tokens, which have attributes `start` and `end` which
            correspond with their index in the Spacy doc (verse). Those indices are
            remapped to their GBI positions with the `aligner`.        
        aligner: remaps Spacy indicies to GBI indicies for tokens
    """
    for span in spans:
        start, end = aligner(span.start), aligner(span.end-1) 
        end += 1 # -1 above to avoid IndexError since end might be +1 longer than end for index slicing
        span_words = set(verse_words[start:end])
        if set(para_words) & span_words:
            return span

bhsa.indent(0, reset=True)
bhsa.info('matching spans...')
        
verse_inspect = collections.defaultdict(lambda: collections.defaultdict(str)) # for exporting an inspection document
bhsa2eng = collections.defaultdict(dict)

for trans, bhsa_nodes in english_verbs.items():
    
    for bhsa_node, para_words in bhsa_nodes.items():

        inspect = '' # for debugging and inspection
        
        # get GBI-side data
        verse_ref = id2ref(para_words[0], 'translation')        
        para_text = ' '.join(word_data[trans][w]['text'] for w in para_words)
        verse_words = verse2words[trans][verse_ref]
        verse_tokens = [word_data[trans][w]['text'] for w in verse_words]
        
        # get Spacy-side data
        verse_parsing = parsed_verses[trans][verse_ref]
        spacy_tokens = [str(t) for t in verse_parsing]
        
        # map Spacy tokens back to GBI tokens using indicies
        # Spacy tokenizes words with apostrophes differently (for e.g. `he'll` == `he` + `'ll`)
        # They can be re-aligned: https://spacy.io/usage/linguistic-features#aligning-tokenization
        cost, a2b, b2a, a2b_multi, b2a_multi = align(spacy_tokens, verse_tokens) # alignment of indicies here
        aligner = lambda i: a2b_multi.get(i, a2b[i]) # returns 1-to-1 or many-to-1 aligned index
        
        # try to retrieve span links with advanced TAM tags
        spans = verse2spans[trans].get(verse_ref, [])
        span_match = trans_to_span(para_words, spans, verse_words, aligner) or '' # search for overlapping GBI id sets
        if span_match:
            tam_tag = span_match._.tam_tag
        else:
            tam_tag = ''
        
        # retrieve basic parsings
        raw_tokens = []
        for i, token in enumerate(verse_parsing):
            if verse_words[aligner(i)] in para_words:
                raw_tokens.append(token)
                
        vb_tokens = [t for t in raw_tokens if t.tag_.startswith('VB')]
            
        # save the data
        data = {
            'words': para_text,
            'tags': '|'.join(t.tag_ for t in raw_tokens),
            'vb_tags': '|'.join(t.tag_ for t in vb_tokens),
            'TAM_cx': tam_tag,
            'TAM_span': f'{span_match}',
        }
        
        bhsa2eng[trans][bhsa_node] = data
            
        # add strings to inspection file
        if span_match and span_match._.tam_tag:
            verse_inspect[trans][verse_ref] += f'\tMATCH: {para_text}\n'
            verse_inspect[trans][verse_ref] += f'\t\t{span_match} -> {span_match._.tam_tag}\n'
        else:
            verse_inspect[trans][verse_ref] += f'\tMISS: {para_text}\n'
            
bhsa.info('done with matches')

  0.00s matching spans...
    21s done with matches


In [42]:
# export
for trans, trans_data  in bhsa2eng.items():
    trans_file = VERB_DIR.joinpath(f'bhsa2{trans}.json')
    with open(trans_file, 'w') as outfile:
        json.dump(trans_data, outfile, ensure_ascii=False, indent=2)
    
    # export inspection file
    write = ''
    inspect_file = PRIVATE_DATA.joinpath(f'debugging/{trans}_inspect.txt')
    for verse, message in verse_inspect[trans].items():
        write += '{} {}:{}'.format(*verse) + '\n'
        write += str(parsed_verses[trans][verse]) + '\n'
        write += message
        write += '\n'
    inspect_file.write_text(write)

In [43]:
# dump the bhsa2wlc dataset
wlc_verbdataset_path = VERB_DIR.joinpath('bhsa2wlc.json') 
with open(wlc_verbdataset_path, 'w') as outfile:
    json.dump(verb_bhsa2gbi, outfile, ensure_ascii=False, indent=2)

<hr>

# Export

In [44]:
# # save these resulting dicts for later convenient use

# save_files = [
#     ('word_data', word_data),
#     ('verse2words', {k:{tuple(k2):v2 for k2,v2 in v.items()} for k,v in verse2words.items()}),
#     ('id_links', linkbyid),
# ]

# for filename, data in save_files:
#     filepath = GBI_DATA_DIR.joinpath(filename+'.json')
#     with open(filepath, 'w') as outfile:
#         json.dump(data, outfile, ensure_ascii=False, indent=2)

In [45]:
# # export the dataset
# dataset_path = PROJ_DIR.joinpath('data/_private_/translation_dataset.csv')

# data_df.to_csv(dataset_path, index=False)

<hr> 
Scratch code

In [46]:
# write = ''
# for verse, cases in verse_sees.items():
#     parsed_verse = parsed_verses['niv'][verse]
#     ref_str = '{} {}:{}'.format(*verse)
#     write += ref_str + '\n'
#     write += str(parsed_verse) + '\n'
#     for case in cases:
#         write += case
#     write += '\n'
    
# Path('see_cases.txt').write_text(write)