# Build Verb Translation Dataset

In this notebook, we build a translation dataset based on NIV and ESV translation
alignments provided by GBI. The GBI data, which uses an underlying WLC Hebrew text
has already been aligned to the Amsterdam BHSA Hebrew dataset in 
[GBI_alignment_wrangling.ipynb](GBI_alignment_wrangling.ipynb). We can thus take
advantage of both databases and their associated data when building our dataset here.

In the dataset we'll attempt to parse the English text so that the syntax and (especially)
the verbal forms can be analyzed alongside the Hebrew grammar. We'll start out with 
Spacy for the English parsings. 

In [2]:
import re
import json
import collections
import re
import pandas as pd
from pathlib import Path
from tf.fabric import Fabric
from tf.app import use

import spacy 
from spacy.tokens import Doc
import pandas as pd

# custom modules 
import tf_tools
from gbi_functions import id2ref
from positions import PositionsTF

# organize pathways
PROJ_DIR = Path.home().joinpath('github/CambridgeSemiticsLab/translation_traditions_HB')
GBI_DATA_DIR = PROJ_DIR.joinpath('data/_private_/GBI_alignment')

# load GBI data
gbi_niv = json.loads(GBI_DATA_DIR.joinpath('niv84.ot.alignment.json').read_text())
#gbi_esv = json.loads(GBI_DATA_DIR.joinpath('esv.ot.alignment.json').read_text())

# load BHSA / GBI Alignment
bhsa2gbi = json.loads(GBI_DATA_DIR.joinpath('bhsa2gbi.json').read_text())

In [3]:
# load BHSA features with genre module
locations = [
    '~/github/etcbc/bhsa/tf/c', 
    '~/github/etcbc/genre_synvar/tf/c',
    '~/github/etcbc/valence/tf/c'
]
TF = Fabric(locations)
extra_features = '''
domain txt ps gn 
nu genre sense
mother
'''
features = tf_tools.standard_features + extra_features
api = TF.load(features)
bhsa = use('bhsa', api=api)
F, E, T, L, Fs, = bhsa.api.F, bhsa.api.E, bhsa.api.T, bhsa.api.L, bhsa.api.Fs

from clause_relas import in_dep_calc as clause_relator

This is Text-Fabric 8.4.0
Api reference : https://annotation.github.io/text-fabric/cheatsheet.html

125 features found and 0 ignored
  0.00s loading features ...
   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API
  5.84s All features loaded/computed - for details use loadLog()


In [5]:
# set up some dictionaries for convenient word data access

gbi_words = collections.defaultdict(dict)
wlc = {}
verse2words = collections.defaultdict(lambda: collections.defaultdict(list))
linkbyid = collections.defaultdict(list)
sources = (('niv', gbi_niv),) #('esv', gbi_esv))

for name, source in sources:
    for verse in source:
        # unpack words for processing
        trans_words =  verse['translation']['words']
        manu_words = verse['manuscript']['words']
        
        # map translation word data
        for w in trans_words:
            ref_tuple = id2ref(w['id'], 'translation')
            verse2words[name][ref_tuple].append(w)
            gbi_words[name][w['id']] = w
        
        # map WLC word data
        # arbitrarily use the copy stored under NIV
        if name == 'niv':
            for w in manu_words:
                verse2words['wlc'][id2ref(w['id'])].append(w)
                wlc[w['id']] = w
                
        # map links to word ids
        for wlc_indices, trans_indices in verse['links']:
            wlc_ids = tuple(manu_words[i]['id'] for i in wlc_indices)
            trans_ids = tuple(sorted(trans_words[i]['id'] for i in trans_indices))
            linkbyid[name].append((wlc_ids, trans_ids))

In [6]:
# view a sampling
for wlc_ids, trn_ids in linkbyid['niv'][:40]:
    wlc_text = ' '.join(wlc[w]['text'].replace('\u200e', '') for w in wlc_ids)
    trn_text = ' '.join(gbi_words['niv'][w]['text'] for w in trn_ids)
    print(wlc_text, '->', trn_text)

בְּ -> In
רֵאשִׁ֖ית -> beginning
בָּרָ֣א -> created
אֱלֹהִ֑ים -> God
הַ -> the
שָּׁמַ֖יִם -> heavens
וְ -> and
הָ -> the
אָֽרֶץ -> earth
וְ -> Now
הָ -> the
אָ֗רֶץ -> earth
הָיְתָ֥ה -> was
תֹ֙הוּ֙ -> formless
וָ -> and
בֹ֔הוּ -> empty
חֹ֖שֶׁךְ -> darkness
עַל־ -> over
פְּנֵ֣י -> surface of
תְה֑וֹם -> deep
וְ -> and
ר֣וּחַ -> Spirit of
אֱלֹהִ֔ים -> God
מְרַחֶ֖פֶת -> was hovering
עַל־ פְּנֵ֥י -> over
הַ -> the
מָּֽיִם -> waters
וַ -> And
יֹּ֥אמֶר -> said
אֱלֹהִ֖ים -> God
יְהִ֣י -> Let there be
א֑וֹר -> light
וַֽ -> and
יְהִי־ -> there was
אֽוֹר -> light
יַּ֧רְא -> saw
אֱלֹהִ֛ים -> God
הָ -> the
א֖וֹר -> light
כִּי־ -> that


## Select Verbs for the Dataset

Now we must search and select the verbs wanted for the analysis dataset. We select
the desired verbs through a few filters. The filters are:

1. agreement between BHSA and WLC on the classification of a word as a verb (e.g. טוֹב); for BHSA classification, 
    we use a contextual definition (phrase-dependent) which classifies whether the word is behaving as a verb in context (e.g. participles)
2. verb tense is selected for inclusion in the dataset; for instance, if we only select 'perf' (qatal), we filter
    out all non-perfect verbs.

Of course, we also skip any words that are not verbs in either dataset.

Since some alignments between BHSA and WLC are `many-to-N` or `N-to-many`, 
we also filter out any of these non-verbal words. This leaves us with
a 1-to-1 alignment, so that 1 verb in BHSA equals 1 verb in WLC.

For each 1-to-1 alignment between BHSA and WLC that meet these requirements,
we will contstruct a row of data which will be entered into a DataFrame (table)
for statistical analysis.

In [17]:
def has_preceding_waw(bhsa_verb):
    """Check whether verb has preceding waw in a clause context."""
    context = PositionsTF(bhsa_verb, 'clause', api)
    prev_word = context.get(-1) or 0
    if F.lex.v(prev_word) == 'W':
        return True
    else:
        return False
        
def build_row(bhsa_verb, wlc_verb):
    """Retrieve variables for a supplied 1-to-1 matched verb pair."""
    
    # grab BHSA data/objects
    ref_tuple = T.sectionFromNode(bhsa_verb)
    ref_string = '{} {}:{}'.format(*ref_tuple)
    verse_node = L.u(bhsa_verb, 'verse')[0]
    clause_atom = L.u(bhsa_verb, 'clause_atom')[0]
    clause = L.u(bhsa_verb, 'clause')[0]
    sent = L.u(bhsa_verb, 'sentence')[0]
    clause_type = F.typ.v(clause)
    
    # grab WLC data    
    wlc_gloss = wlc[wlc_verb]['gloss']
    wlc_parse = wlc[wlc_verb]['morph']
        
    return {
        'bhsa_node': bhsa_verb,
        'wlc_id': wlc_verb,
        'ref': ref_string, 
        'book': ref_tuple[0], 
        'text_full': F.g_word_utf8.v(bhsa_verb),
        'text_plain': F.g_cons_utf8.v(bhsa_verb),
        'lex': F.lex_utf8.v(bhsa_verb),
        'lex_etcbc': F.lex.v(bhsa_verb),
        'gloss': wlc_gloss,
        'tense': F.vt.v(bhsa_verb),
        'stem': F.vs.v(bhsa_verb),
        'person': F.ps.v(bhsa_verb),
        'gender': F.gn.v(bhsa_verb),
        'number': F.nu.v(bhsa_verb),
        'wlc_morph': wlc_parse,
        'sentence': T.text(sent),
        'genre': F.genre.v(verse_node),
        'domain': F.domain.v(clause),
        'txt_type': F.txt.v(clause),
        'clause_type': clause_type,
        'clause_rela': clause_relator(clause),
        'preceding_waw': has_preceding_waw(bhsa_verb),
        'valence': F.sense.v(bhsa_verb),
    }
    
# here we put the verb tenses that we 
# want to select in the big loop
select = {'perf'}

# the big loop maps 1 BHSA verb to 1 WLC verb in this dict
verb_data = []

# here we put the verb stems that we 
# want to select in the big loop
# these are the verb types that will go into the final dataset
select = {'perf'}

# the big loop maps 1 BHSA verb to 1 WLC verb in this dict
verb_data = []

# instances where BHSA and WLC disagree on the 
# classification of a verb
verb_disagree = []

# verbs that are not selected go here simply 
# to build a count of how many valid forms are found 
other_verbs = []

# find 1-to-1 matches of BHSA and WLC verbs
for bhsa_nodes, gbi_ids in bhsa2gbi:
    
    # filter out non-verbs from the links
    bhsa_verbs = [w for w in bhsa_nodes if F.pdp.v(w) == 'verb'] 
    wlc_verbs = [w for w in gbi_ids if wlc[w]['pos'] == 'verb']
    data = (T.text(bhsa_nodes), T.sectionFromNode(bhsa_nodes[0]), bhsa_nodes, gbi_ids)
    
    # one case, Jer 51:3, has a double verb mapping caused by 
    # ידרך ידרך, which BHSA maps to a single word node, and gbi 
    # keeps as 2 words; we disambig that here and keep only 
    # first gbi word
    if bhsa_verbs and bhsa_verbs[0] == 262780:
         wlc_verbs = wlc_verbs[:1]
    
    # skip non-verbal contexts
    if not bhsa_verbs + wlc_verbs:
        continue
    
    # track disagreements between 2 sources
    elif (bhsa_verbs and not wlc_verbs) or (wlc_verbs and not bhsa_verbs):
        verb_disagree.append(data)
    
    # store result both ways: bhsa 2 wlc, wlc 2 bhsa
    elif len(bhsa_verbs) == 1 and len(wlc_verbs) == 1:
        
        # make a subset selection of verbs
        bhsa_verb, wlc_verb = bhsa_verbs[0], wlc_verbs[0]
        parse = F.vt.v(bhsa_verb)
        if parse in select:
            verb_data.append(build_row(bhsa_verb, wlc_verb))
        else:
            other_verbs.append([bhsa_verb, wlc_verb])
    
    # or there's a problem...
    else:
        raise Exception(f'Misalignment at {data}')
        
        
print(len(verb_disagree), 'verbs excluded due to pos disagreement')
print(len(verb_data)+len(other_verbs), 'verbs agree')
print(len(verb_data), 'selected for building dataset')

4427 verbs excluded due to pos disagreement
68826 verbs agree
21082 selected for building dataset


Sample the dataset:

In [12]:
verb_data[0]

{'bhsa_node': 3,
 'wlc_id': 10010010021,
 'ref': 'Genesis 1:1',
 'book': 'Genesis',
 'text_full': 'בָּרָ֣א',
 'text_plain': 'ברא',
 'lex': 'ברא',
 'lex_etcbc': 'BR>[',
 'gloss': 'he created',
 'tense': 'perf',
 'stem': 'qal',
 'person': 'p3',
 'gender': 'm',
 'number': 'sg',
 'wlc_morph': 'vqp3ms',
 'sentence': 'בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ ',
 'genre': 'prose',
 'domain': '?',
 'txt_type': '?',
 'clause_type': 'xQtX',
 'clause_rela': 'Main',
 'preceding_waw': False,
 'valence': 'd-'}

NB that the ~4.4k verbs in disagreement is because we use contextual parts of speech
from the BHSA dataset. The GBI dataset does not seem to be as sensitive to context for
pos. A large proportion of these cases are participles used as nouns rather than verbs.

In [9]:
verb_disagree[500:510]

[('פְּקֻדֵיהֶ֖ם ', ('Numbers', 1, 39), [70153], [40010390011, 40010390012]),
 ('יֹצֵ֥א ', ('Numbers', 1, 40), [70183], [40010400141]),
 ('פְּקֻדֵיהֶ֖ם ', ('Numbers', 1, 41), [70185], [40010410011, 40010410012]),
 ('יֹצֵ֥א ', ('Numbers', 1, 42), [70214], [40010420141]),
 ('פְּקֻדֵיהֶ֖ם ', ('Numbers', 1, 43), [70216], [40010430011, 40010430012]),
 ('פְּקֻדִ֡ים ', ('Numbers', 1, 44), [70229], [40010440022]),
 ('פְּקוּדֵ֥י ', ('Numbers', 1, 45), [70250], [40010450031]),
 ('פְּקֻדִ֔ים ', ('Numbers', 1, 46), [70271], [40010460032]),
 ('פְקֻדֵיהֶ֑ם ', ('Numbers', 2, 4), [70477], [40020040022, 40020040023]),
 ('פְקֻדָ֑יו ', ('Numbers', 2, 6), [70502], [40020060022, 40020060023])]

### Cluster English text into contextualized chunks for parsing

Translations are currently aligned to each word. We'll need to be 
able to contextualize translated verbs within their sentences in order
to run the Spacy dependency parser on the full sentence. Here we build
the contexts that the parser will parse.

The translation data stores punctuators as separate parts, e.g.

```
{'id': 1001001012,
  'altId': '.-1',
  'text': '.',
  'transType': '',
  'isPunc': True,
  'isPrimary': False}
```

We will first select verses, then split on punctuators.

Procedure:

1. For each verb, get the verse in English translation, words in verse, and split verse by punctuators.
2. Map strings to string objects so they can be tracked while being manipulated like normal strings.

In [None]:
class ContextualVerb:
    
    def __init__(bhsa2gbi):
        
        bhsa_verb, gbi_verb = bhsa2gbi
        
    def index_verse_context():
        

In [163]:
wlc[10070130171]

{'id': 10070130171,
 'altId': 'אֶל־\u200e-1',
 'text': 'אֶל־\u200e',
 'strong': 'H0413',
 'gloss': 'into',
 'gloss2': '入',
 'lemma': 'אֶל',
 'pos': 'prep',
 'morph': 'Pp'}

In [164]:
id2ref(10070130171)

('Genesis', 7, 13)

## Parsing English with Spacy

For understanding the basics of Spacy, see:
https://spacy.io/usage/linguistic-features

For each verb in the `select_verbs` dictionary, we retrieve its verse text in a
given translation. The translated text is parsed by Spacy, which supplies us with
a dependency tree, parts of speech, and verb tenses for the English side of things.

For Spacy tags used with the model of choice, see:
https://github.com/explosion/spacy-models/releases//tag/en_core_web_sm-2.3.1

In [14]:
pd.set_option('display.max_rows', 200)
nlp = spacy.load('en_core_web_sm')

In [26]:
def get_links(gbi_id, source):
    """Retrieve linked pairs of IDs for GBI data"""
    for heb, eng in linkbyid[source]:
        if gbi_id in heb:
            return (heb, eng)

def process_tokens(gbi_tokens):
    """Get tokens for a given reference tuple"""
    tokens = [t['text'] for t in gbi_tokens]
    user_data = {i:t['id'] for i,t in enumerate(gbi_tokens)}
    doc = Doc(nlp.vocab, words=tokens, user_data=user_data)
    return doc

nlp.tokenizer = process_tokens

# build experimental dataset

data = []
missed_verbs = []

bhsa.indent(0, reset=True)
bhsa.info('begin')
bhsa.indent(1, reset=True)
for i, vd in enumerate(verb_data):
    
    bverb, wverb = vd['bhsa_node'], vd['wlc_id']
    
    try:
        linked_heb, linked_eng = get_links(wverb, 'niv')
    except:
        missed_verbs.append((bverb, wverb))
        continue
        
    ref_word = gbi_words['niv'][linked_eng[0]] # get arbitrary word for referencing
    eng_ref = id2ref(ref_word['id'], 'translation') # get english verse ref tuple
    eng_verse_words = verse2words['niv'][eng_ref] # get English words from the verse
    spacy_doc = nlp(eng_verse_words) # parse English
    imap = spacy_doc.user_data
    
    # filter out those parsed Eng. words to keep 
    of_interest = [] 
    for w in spacy_doc:
        conds = [
            imap[w.i] in linked_eng,
            w.tag_ not in {'PRP'}
        ]
        if all(conds):
            of_interest.append(w)
    
    vd['niv_tags'] = '|'.join(w.tag_ for w in of_interest) 
    vd['niv_dep'] = '|'.join(w.dep_ for w in of_interest)
    vd['niv_words'] = ' '.join(str(w) for w in of_interest)

    if i % 5000 == 0 and i != 0:
        bhsa.info(f'done with iteration {i}')
     
bhsa.indent(0)
bhsa.info('done!')

  0.00s begin
   |    1m 11s done with iteration 5000
   |    3m 03s done with iteration 10000
   |    5m 41s done with iteration 15000
   |    8m 03s done with iteration 20000
 8m 30s done!


In [27]:
data_df = pd.DataFrame(verb_data)

data_df.head()

Unnamed: 0,bhsa_node,wlc_id,ref,book,text_full,text_plain,lex,lex_etcbc,gloss,tense,...,genre,domain,txt_type,clause_type,clause_rela,preceding_waw,valence,niv_tags,niv_dep,niv_words
0,3,10010010021,Genesis 1:1,Genesis,בָּרָ֣א,ברא,ברא,BR>[,he created,perf,...,prose,?,?,xQtX,Main,False,d-,VBD,ROOT,created
1,15,10010020021,Genesis 1:2,Genesis,הָיְתָ֥ה,היתה,היה,HJH[,she was,perf,...,prose,?,?,WXQt,Main,False,--,VBD,ROOT,was
2,69,10010050061,Genesis 1:5,Genesis,קָ֣רָא,קרא,קרא,QR>[,he called,perf,...,prose,N,?N,WxQ0,Main,False,l.,VBD,relcl,called
3,172,10010100071,Genesis 1:10,Genesis,קָרָ֣א,קרא,קרא,QR>[,he called,perf,...,prose,N,?N,WxQ0,Main,False,l.,VBD,relcl,called
4,255,10010140122,Genesis 1:14,Genesis,הָי֤וּ,היו,היה,HJH[,let them be,perf,...,prose,Q,?NQ,WQt0,Main,True,--,,,


In [28]:
data_df.head(25)

Unnamed: 0,bhsa_node,wlc_id,ref,book,text_full,text_plain,lex,lex_etcbc,gloss,tense,...,genre,domain,txt_type,clause_type,clause_rela,preceding_waw,valence,niv_tags,niv_dep,niv_words
0,3,10010010021,Genesis 1:1,Genesis,בָּרָ֣א,ברא,ברא,BR>[,he created,perf,...,prose,?,?,xQtX,Main,False,d-,VBD,ROOT,created
1,15,10010020021,Genesis 1:2,Genesis,הָיְתָ֥ה,היתה,היה,HJH[,she was,perf,...,prose,?,?,WXQt,Main,False,--,VBD,ROOT,was
2,69,10010050061,Genesis 1:5,Genesis,קָ֣רָא,קרא,קרא,QR>[,he called,perf,...,prose,N,?N,WxQ0,Main,False,l.,VBD,relcl,called
3,172,10010100071,Genesis 1:10,Genesis,קָרָ֣א,קרא,קרא,QR>[,he called,perf,...,prose,N,?N,WxQ0,Main,False,l.,VBD,relcl,called
4,255,10010140122,Genesis 1:14,Genesis,הָי֤וּ,היו,היה,HJH[,let them be,perf,...,prose,Q,?NQ,WQt0,Main,True,--,,,
5,267,10010150012,Genesis 1:15,Genesis,הָי֤וּ,היו,היה,HJH[,let them be,perf,...,prose,Q,?NQ,WQt0,Main,True,-p,VB|VB,ROOT|ccomp,let be
6,397,10010210121,Genesis 1:21,Genesis,שָׁרְצ֨וּ,שׁרצו,שׁרץ,CRY[,they teem,perf,...,prose,N,?N,xQtX,SubMod,False,--,NNS,relcl,teems
7,545,10010270081,Genesis 1:27,Genesis,בָּרָ֣א,ברא,ברא,BR>[,he created,perf,...,prose,N,?N,xQt0,Main,False,d-,VBD,ccomp,created
8,550,10010270121,Genesis 1:27,Genesis,בָּרָ֥א,ברא,ברא,BR>[,he created,perf,...,prose,N,?N,xQt0,Main,False,n.,VBD,ROOT,created
9,594,10010290041,Genesis 1:29,Genesis,נָתַ֨תִּי,נתתי,נתן,NTN[,I give,perf,...,prose,Q,?NQ,xQt0,Main,False,di,VBP,ccomp,give
