# Build Verb Translation Dataset

In this notebook, we build a translation dataset based on NIV and ESV translation
alignments provided by GBI. The GBI data, which uses an underlying WLC Hebrew text
has already been aligned to the Amsterdam BHSA Hebrew dataset in 
[GBI_alignment_wrangling.ipynb](GBI_alignment_wrangling.ipynb). We can thus take
advantage of both databases and their associated data when building our dataset here.

In the dataset we'll attempt to parse the English text so that the syntax and (especially)
the verbal forms can be analyzed alongside the Hebrew grammar. We'll start out with 
Spacy for the English parsings. 

In [1]:
import re
import json
import collections
import re
import pandas as pd
from pathlib import Path
from tf.app import use
from gbi_functions import id2ref

# organize pathways
PROJ_DIR = Path.home().joinpath('github/CambridgeSemiticsLab/translation_traditions_HB')
GBI_DATA_DIR = PROJ_DIR.joinpath('data/_private_/GBI_alignment')

# load BHSA data
bhsa = use('bhsa')
api = bhsa.api
F, E, T, L, Fs, = api.F, api.E, api.T, api.L, api.Fs

# load GBI data
gbi_niv = json.loads(GBI_DATA_DIR.joinpath('niv84.ot.alignment.json').read_text())
gbi_esv = json.loads(GBI_DATA_DIR.joinpath('esv.ot.alignment.json').read_text())

# load BHSA / GBI Alignment
bhsa2gbi = json.loads(GBI_DATA_DIR.joinpath('bhsa2gbi.json').read_text())

In [2]:
# set up some dictionaries for convenient word data access

gbi_words = collections.defaultdict(dict)
wlc = {}
verse2words = collections.defaultdict(lambda: collections.defaultdict(list))
linkbyid = collections.defaultdict(list)
sources = (('niv', gbi_niv), ('esv', gbi_esv))

for name, source in sources:
    for verse in source:
        # unpack words for processing
        trans_words =  verse['translation']['words']
        manu_words = verse['manuscript']['words']
        
        # map translation word data
        for w in trans_words:
            ref_tuple = id2ref(w['id'], 'translation')
            verse2words[name][ref_tuple].append(w)
            gbi_words[name][w['id']] = w
        
        # map WLC word data
        # arbitrarily use the copy stored under NIV
        if name == 'niv':
            for w in manu_words:
                verse2words['wlc'][id2ref(w['id'])].append(w)
                wlc[w['id']] = w
                
        # map links to word ids
        for wlc_indices, trans_indices in verse['links']:
            wlc_ids = tuple(manu_words[i]['id'] for i in wlc_indices)
            trans_ids = tuple(sorted(trans_words[i]['id'] for i in trans_indices))
            linkbyid[name].append((wlc_ids, trans_ids))

In [4]:
# view a sampling
for wlc_ids, trn_ids in linkbyid['esv'][:40]:
    wlc_text = ' '.join(wlc[w]['text'].replace('\u200e', '') for w in wlc_ids)
    trn_text = ' '.join(gbi_words['esv'][w]['text'] for w in trn_ids)
    print(wlc_text, '->', trn_text)

בְּ -> In
רֵאשִׁ֖ית -> beginning
בָּרָ֣א -> created
אֱלֹהִ֑ים -> God
הַ -> the
שָּׁמַ֖יִם -> heavens
וְ -> and
הָ -> the
אָֽרֶץ -> earth
הָ -> The
אָ֗רֶץ -> earth
הָיְתָ֥ה -> was
תֹ֙הוּ֙ -> without form
וָ -> and
בֹ֔הוּ -> void
וְ -> and
חֹ֖שֶׁךְ -> darkness
עַל־ -> over
פְּנֵ֣י -> face of
תְה֑וֹם -> deep
וְ -> And
ר֣וּחַ -> Spirit of
אֱלֹהִ֔ים -> God
מְרַחֶ֖פֶת -> was hovering
עַל־ -> over
פְּנֵ֥י -> face of
הַ -> the
מָּֽיִם -> waters
וַ -> And
יֹּ֥אמֶר -> said
אֱלֹהִ֖ים -> God
יְהִ֣י -> Let there be
א֑וֹר -> light
וַֽ -> and
יְהִי־ -> there was
אֽוֹר -> light
וַ -> And
יַּ֧רְא -> saw
אֱלֹהִ֛ים -> God
הָ -> the


## Verb forms dataset

Target verb forms for initial dataset:

* qatal/w+qatal
* yiqtol
* wayyiqtol

### Pre-heat GBI/BHSA verbs

As a first step, we preprocess the GBI/BHSA alignments to one-to-one since
they can be many-to-many. This way, one verb equals one verb. 
Many of the 1-to-2 alignments are due to the presence of pronominal
suffixes. In such cases, we need to filter them out to leave only 
the verbs behind.

In [5]:
bhsa2gbi_verbs = {}
verb_disagree = []
verb_plural = []
verb_map = {}

for bhsa_nodes, gbi_ids in bhsa2gbi:
    
    # filter out only verbs
    bhsa_verbs = [w for w in bhsa_nodes if F.pdp.v(w) == 'verb'] 
    wlc_verbs = [w for w in gbi_ids if wlc[w]['pos'] == 'verb']
    data = (T.text(bhsa_nodes), T.sectionFromNode(bhsa_nodes[0]), bhsa_nodes, gbi_ids)
    
    # one case, Jer 51:3, has a double verb mapping caused by 
    # ידרך ידרך, which BHSA maps to a single word node, and gbi 
    # keeps as 2 words; we disambig that here and keep only 
    # first gbi word
    if bhsa_verbs and bhsa_verbs[0] == 262780:
         wlc_verbs = wlc_verbs[:1]
    
    # skip non-verbal contexts
    if not bhsa_verbs + wlc_verbs:
        continue
    
    # track disagreements between 2 sources
    elif (bhsa_verbs and not wlc_verbs) or (wlc_verbs and not bhsa_verbs):
        verb_disagree.append(data)
    
    # store result
    elif len(bhsa_verbs) == 1 and len(wlc_verbs) == 1:
        verb_map[bhsa_verbs[0]] = wlc_verbs[0]
    
    # or there's a problem...
    else:
        raise Exception(f'Misalignment at {data}')
        
        
print(len(verb_disagree), 'verbs excluded due to pos disagreement')
print(len(verb_plural), 'verbal contexts have more than one verb')
print(len(verb_map), 'verbs agree')

4427 verbs excluded due to pos disagreement
0 verbal contexts have more than one verb
68826 verbs agree


NB that the ~4.4k verbs in disagreement is because we use contextual parts of speech
from the BHSA dataset. The GBI dataset does not seem to be as sensitive to context for
pos. A large proportion of these cases are participles used as nouns rather than verbs.

In [None]:
verb_disagree[500:510]

### Cluster English text into contextualized chunks for parsing

Translations are currently aligned to each word. We'll need to be 
able to contextualize translated verbs within their sentences in order
to run the Spacy dependency parser on the full sentence. Here we build
the contexts that the parser will parse.

The translation data stores punctuators as separate parts, e.g.

```
{'id': 1001001012,
  'altId': '.-1',
  'text': '.',
  'transType': '',
  'isPunc': True,
  'isPrimary': False}
```

We will first select verses, then split on punctuators.

Procedure:

1. For each verb, get the verse in English translation, words in verse, and split verse by punctuators.
2. Map strings to string objects so they can be tracked while being manipulated like normal strings.

## Parsing English with Spacy

For understanding the basics of Spacy, see:
https://spacy.io/usage/linguistic-features

In [20]:
import spacy 
nlp = spacy.load('en_core_web_sm')
test_str = 'In the beginning, God created the heavens and the earth. Let there be light.'

doc = nlp(test_str)

In [23]:
for token in doc:
    print(token.text, token.tag_, token.pos_, token.dep_)

In IN ADP prep
the DT DET det
beginning NN NOUN pobj
, , PUNCT punct
God NNP PROPN nsubj
created VBD VERB ROOT
the DT DET det
heavens NNPS PROPN dobj
and CC CCONJ cc
the DT DET det
earth NN NOUN conj
. . PUNCT punct
Let VB VERB ROOT
there EX PRON expl
be VB AUX ccomp
light JJ ADJ attr
. . PUNCT punct


In [24]:
spacy.explain('VBD')

'verb, past tense'

In [25]:
spacy.explain('EX')

'existential there'

In [26]:
spacy.explain('VB')

'verb, base form'

In [27]:
spacy.explain('expl')

'expletive'

In [28]:
spacy.explain('ccomp')

'clausal complement'

Looking at sentence segementations:

In [34]:
for sent in doc.sents:
    print(sent)

In the beginning, God created the heavens and the earth.
Let there be light.


In [35]:
# noun segements
for nc in doc.noun_chunks:
    print(nc)

the beginning
God
the heavens
the earth
