<a href="http://www.godgeleerdheid.vu.nl/etcbc" target="_blank"><img align="left" src="images/VU-ETCBC-small.png"/></a>
<a href="http://laf-fabric.readthedocs.org/en/latest/" target="_blank"><img align="right" src="images/laf-fabric-small.png"/></a>
<a href="http://www.persistent-identifier.nl/?identifier=urn%3Anbn%3Anl%3Aui%3A13-048i-71" target="_blank"><img align="right"src="images/DANS-small.png"/></a>

# Complement corrections


# 0. Introduction

Joint work of Dirk Roorda and Janet Dyk.

In order to do
[flowchart analysis](https://shebanq.ancient-data.org/shebanq/static/docs/tools/valence/flowchart.html)
on verbs, we need to correct some coding errors.

Because the flowchart assigns meanings to verbs depending on the number and nature of complements found in their context, it is important that the phrases in those clauses are labeled correctly, i.e. that the
[function](https://shebanq.ancient-data.org/shebanq/static/docs/featuredoc/features/comments/function.html)
feature for those phrases have the correct label.

# References

(Janet Dyk, Reinoud Oosting and Oliver Glanz, 2014) 
Analysing Valence Patterns in Biblical Hebrew: Theoretical Questions and Analytic Frameworks.
*J. of Northwest Semitic Languages, vol. 40 (2014), no. 1, pp. 43-62*.
[pdf abstract](http://academic.sun.ac.za/jnsl/Volumes/JNSL%2040%201%20abstracts%20and%20bookreview.pdf)
[pdf fulltext (author's copy with deviant page numbering)](https://shebanq.ancient-data.org/static/docs/methods/2014_Dyk_jnsl.pdf)

(Janet Dyk 2014)
Deportation or Forgiveness in Hosea 1.6? Verb Valence Patterns and Translation Proposals.
*The Bible Translator 2014, Vol. 65(3) 235–279*.
[pdf](http://tbt.sagepub.com/content/65/3/235.full.pdf?ijkey=VK2CEHvVrvSGA5B&keytype=finite)

(Janet Dyk 014)
Traces of Valence Shift in Classical Hebrew.
In: *Discourse, Dialogue, and Debate in the Bible: Essays in Honour of Frank Polak*.
Ed. Athalya Brenner-Idan.
*Sheffield Pheonix Press, 48–65*.
[book behind pay-wall](http://www.sheffieldphoenix.com/showbook.asp?bkid=273)

# 1. Task
In this notebook we do the following tasks:

* generate correction sheets for selected verbs,
* process the set of filled in correction sheets
* generate sheets with computed, new features (based on corrected values, valence related) to be edited manually
* transform the set of filled in enrichment sheets into an annotation package

Between the first and second task, the sheets will have been filled in by Janet with corrections.
Between the third and the fourth task, the sheets will be inspected and improved by Janet.

The resulting annotation package offers the corrections as the value of a new feature, also called `function`, but now in the annotation space `JanetDyk` instead of `etcbc4`.
The results of the enrichment will be added as new features in that same annotation space.

## 1.1 Limitations
We restrict ourselves to verb occurrences where the verb is the nucleus of a phrase with function *predicate*. 
There are also verb occurrences in other kinds of phrases, and these also can have complements. These cases are coded very differently in the database. See for example [Joshua 3:8](https://shebanq.ancient-data.org/hebrew/text?book=Josua&chapter=3&verse=8&version=4b&mr=m&qw=q&tp=txt_tb1&tr=hb&wget=v&qget=v&nget=v). (*and you command the priest carrying the ark* ...).

# 2. Implementation

Start the engines, and note the import of the `ExtraData` functionality from the `etcbc.extra` module.
This module can turn data with anchors into additional LAF annotations to the big ETCBC LAF resource.

In [3]:
import sys,os, collections
from copy import deepcopy

import laf
from laf.fabric import LafFabric
from etcbc.preprocess import prepare
from etcbc.extra import ExtraData

fabric = LafFabric()

  0.00s This is LAF-Fabric 4.8.3
API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html
Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html



In [4]:
source = 'etcbc'
version = '4b'

We instruct the API to load data.
Note that we ask for the XML identifiers, because `ExtraData` needs them to stitch the corrections into the LAF XML.

In [5]:
API = fabric.load(source+version, 'lexicon', 'flow_corr', {
    "xmlids": {"node": True, "edge": False},
    "features": ('''
        oid otype
        sp vs lex uvf prs nametype ls
        function rela typ
        chapter verse
    ''','''
        mother
    '''),
    "prepare": prepare,
}, verbose='DETAIL')
exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.00s DETAIL: COMPILING m: etcbc4b: UP TO DATE
  0.01s USING main: etcbc4b DATA COMPILED AT: 2015-11-02T15-08-56
  0.01s DETAIL: COMPILING a: lexicon: UP TO DATE
  0.01s USING annox: lexicon DATA COMPILED AT: 2016-07-08T14-32-54
  0.02s DETAIL: load main: G.node_anchor_min
  0.07s DETAIL: load main: G.node_anchor_max
  0.13s DETAIL: load main: G.node_sort
  0.19s DETAIL: load main: G.node_sort_inv
  0.66s DETAIL: load main: G.edges_from
  0.71s DETAIL: load main: G.edges_to
  0.77s DETAIL: load main: X. [node]  -> 
  1.87s DETAIL: load main: X. [node]  <- 
  2.53s DETAIL: load main: F.etcbc4_db_oid [node] 
  3.18s DETAIL: load main: F.etcbc4_db_otype [node] 
  3.80s DETAIL: load main: F.etcbc4_ft_function [node] 
  3.91s DETAIL: load main: F.etcbc4_ft_lex [node] 
  4.08s DETAIL: load main: F.etcbc4_ft_ls [node] 
  4.25s DETAIL: load main: F.etcbc4_ft_prs [node] 
  4.42s DETAIL: load main: F.etcbc4_ft_rela [node] 
  4.74s DETAIL: load main: F.etcb

# 2.1 Locations

In [6]:
ln_base = 'https://shebanq.ancient-data.org/hebrew/text'
ln_tpl = '?book={}&chapter={}&verse={}'
ln_tweak = '&version=4b&mr=m&qw=n&tp=txt_tb1&tr=hb&wget=x&qget=v&nget=x'

home_dir = os.path.expanduser('~').replace('\\', '/')
base_dir = '{}/Dropbox/SYNVAR'.format(home_dir)
result_dir = '{}/results'.format(base_dir)
all_results = '{}/all.csv'.format(result_dir)
kinds = ('corr_blank', 'corr_filled', 'enrich_blank', 'enrich_filled')
kdir = {}
for k in kinds:
    kd = '{}/{}'.format(base_dir, k)
    kdir[k] = kd
    if not os.path.exists(kd):
        os.makedirs(kd)
if not os.path.exists(result_dir):
    os.makedirs(result_dir)


def vfile(verb, kind):
    if kind not in kinds:
        msg('Unknown kind `{}`'.format(kind))
        return None
    return '{}/{}_{}{}.csv'.format(kdir[kind], verb.replace('>','a').replace('<', 'o'), source, version)

# 2.2 Domain
Here is the set of verbs that interest us.

In [7]:
verbs_initial = set('''
    CJT
    BR>
    QR>
'''.strip().split())

motion_verbs = set('''
    <BR
    <LH
    BW>
    CWB
    HLK
    JRD
    JY>
    NPL
    NWS
    SWR
'''.strip().split())

double_object_verbs = set('''
    NTN
    <FH
    FJM
'''.strip().split())

complex_qal_verbs = set('''
    NF>
    PQD
'''.strip().split())

verbs = verbs_initial | motion_verbs | double_object_verbs | complex_qal_verbs

# 2.3 Phrase function

We need to correct some values of the phrase function.
When we receive the corrections, we check whether they have legal values.
Here we look up the possible values.

In [8]:
predicate_functions = {
    'Pred', 'PreS', 'PreO', 'PreC', 'PtcO', 'PrcS',
}

In [9]:
legal_values = dict(
    function={F.function.v(p) for p in F.otype.s('phrase')},
)

We generate a list of occurrences of those verbs, organized by the lexeme of the verb.
We need some extra values, to indicate other coding errors.

In [10]:
error_values = dict(
    function=dict(
        BoundErr='this constituent is part of another constituent and does not merit its own function/type/rela value',
    ),
)

We add the error_values to the legal values.

In [11]:
for feature in set(legal_values.keys()) | set(error_values.keys()):
    ev = error_values.get(feature, {})
    if ev:
        lv = legal_values.setdefault(feature, set())
        lv |= set(ev.keys())
inf('{}'.format(legal_values))

  1.47s {'function': {'PreO', 'Loca', 'Ques', 'PrcS', 'Conj', 'PreC', 'NCoS', 'ModS', 'Cmpl', 'Supp', 'Voct', 'Modi', 'Subj', 'Exst', 'PtcO', 'Frnt', 'Time', 'Nega', 'PreS', 'Pred', 'Intj', 'IntS', 'PrAd', 'BoundErr', 'Adju', 'NCop', 'EPPr', 'Objc', 'Rela', 'ExsS'}}


In [12]:
inf('Finding occurrences ...')
# we restrict our selves to selected verbs and their contexts
occs = collections.defaultdict(list)   # dictionary of verb occurrence nodes per verb lexeme
npoccs = collections.defaultdict(list) # same, but those not occurring in a "predicate"
clause_verb = collections.defaultdict(list)    # dictionary of verb occurrence nodes per clause node
clause_verb_index = collections.defaultdict(set) # mapping from clauses to its main verb(s)
verb_clause_index = collections.defaultdict(list) # mapping from verbs to the clauses of which it is main verb

nw = 0
nws = 0
for w in F.otype.s('word'):
    if F.sp.v(w) != 'verb': continue
    lex = F.lex.v(w).rstrip('[')
    if lex not in verbs: continue
    nw += 1
    pf = F.function.v(L.u('phrase', w))
    if pf not in predicate_functions:
        npoccs[lex].append(w)
    occs[lex].append(w)
    cn = L.u('clause', w)
    clause_verb[cn].append(w)
    clause_verb_index[cn].add(lex)
    verb_clause_index[lex].append(cn)

inf('Done')
inf('Selected:    {:>6} verb occurrences in {} clauses'.format(nw, len(clause_verb)), withtime=False)

for verb in sorted(verbs):
    inf('{} {:>5} occurrences of which {:>4} outside a predicate phrase'.format(
        verb, 
        len(occs[verb]),
        len(npoccs[verb]),
        withtime=False,
    ))

  1.51s Finding occurrences ...
  3.14s Done
Selected:     16036 verb occurrences in 15880 clauses
  3.14s <BR   548 occurrences of which   32 outside a predicate phrase
  3.14s <FH  2629 occurrences of which   59 outside a predicate phrase
  3.14s <LH   890 occurrences of which   10 outside a predicate phrase
  3.14s BR>    48 occurrences of which    3 outside a predicate phrase
  3.14s BW>  2570 occurrences of which   27 outside a predicate phrase
  3.14s CJT    85 occurrences of which    1 outside a predicate phrase
  3.14s CWB  1037 occurrences of which   22 outside a predicate phrase
  3.15s FJM   609 occurrences of which    3 outside a predicate phrase
  3.15s HLK  1554 occurrences of which   30 outside a predicate phrase
  3.15s JRD   377 occurrences of which   16 outside a predicate phrase
  3.15s JY>  1069 occurrences of which   32 outside a predicate phrase
  3.15s NF>   656 occurrences of which   52 outside a predicate phrase
  3.15s NPL   445 occurrences of which   11 outsi

# 3 Blank sheet generation
Generate correction sheets.
They are CSV files. Every row corresponds to a verb occurrence.
The fields per row are the node numbers of the clause in which the verb occurs, the node number of the verb occurrence, the text of the verb occurrence (in ETCBC transliteration, consonantal) a passage label (book, chapter, verse), and then 4 columns for each phrase in the clause:

* phrase node number
* phrase text (ETCBC translit consonantal)
* original value of the `function` feature
* corrected value of the `function` feature (generated as empty)

In [13]:
phrases_seen = collections.Counter()

def gen_sheet(verb):
    rows = []
    fieldsep = ';'
    field_names = '''
        clause#
        word#
        passage
        link
        verb
        stem
    '''.strip().split()
    max_phrases = 0
    clauses_seen = set()
    for wn in occs[verb]:
        cln = L.u('clause', wn)
        if cln in clauses_seen: continue
        clauses_seen.add(cln)
        vn = L.u('verse', wn)
        bn = L.u('book', wn)
        ch = F.chapter.v(vn)
        vs = F.verse.v(vn)
        passage_label = T.passage(vn)
        ln = ln_base+(ln_tpl.format(T.book_name(bn, lang='la'), ch, vs))+ln_tweak
        lnx = '''"=HYPERLINK(""{}""; ""link"")"'''.format(ln)
        vt = T.words([wn], fmt='ec').replace('\n', '')
        vstem = F.vs.v(wn)
        np = '* ' if wn in npoccs[verb] else ''
        row = [cln, wn, passage_label, lnx, np+vt, vstem]
        phrases = L.d('phrase', cln)
        n_phrases = len(phrases)
        if n_phrases > max_phrases: max_phrases = n_phrases
        for pn in phrases:
            phrases_seen[pn] += 1
            pt = T.words(L.d('word', pn), fmt='ec').replace('\n', '')
            pf = F.function.v(pn)
            pnp = np if pf in predicate_functions else ''
            row.extend((pn, pnp+pt, pf, ''))
        rows.append(row)
    for i in range(max_phrases):
        field_names.extend('''
            phr{i}#
            phr{i}_txt
            phr{i}_function
            phr{i}_corr
        '''.format(i=i+1).strip().split())
    filename = vfile(verb, 'corr_blank')
    row_file = open(filename, 'w')
    row_file.write('{}\n'.format(fieldsep.join(field_names)))
    for row in rows:
        row_file.write('{}\n'.format(fieldsep.join(str(x) for x in row)))
    row_file.close()
    inf('Generated correction sheet for verb {}'.format(filename))
    
for verb in verbs: gen_sheet(verb)
    
stats = collections.Counter()
for (p, times) in phrases_seen.items(): stats[times] += 1
for (times, n) in sorted(stats.items(), key=lambda y: (-y[1], y[0])):
    inf('{:<6} phrases seen {:<2} time(s)'.format(n, times))
inf('Total phrases seen: {}'.format(len(phrases_seen)))

  3.70s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/BWa_etcbc4b.csv
  3.83s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/QRa_etcbc4b.csv
  3.91s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/NPL_etcbc4b.csv
  3.92s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/BRa_etcbc4b.csv
  3.99s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/JRD_etcbc4b.csv
  4.17s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/JYa_etcbc4b.csv
  4.31s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/oLH_etcbc4b.csv
  4.43s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/NFa_etcbc4b.csv
  4.80s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/oFH_etcbc4b.csv
  4.82s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/CJT_etcbc4b.csv
  5.03s Generated co

# 4 Processing corrections
We read the filled-in correction sheets and extract the correction data out of it.
We store the corrections in a dictionary keyed by the phrase node.
We check whether we get multiple corrections for the same phrase.

In [14]:
phrases_seen = collections.Counter()
pf_corr = {}

def read_corr():
    function_values = legal_values['function']

    for verb in sorted(verbs):
        repeated = collections.defaultdict(list)
        non_phrase = set()
        illegal_fvalue = set()

        filename = vfile(verb, 'corr_filled')
        if not os.path.exists(filename):
            msg('NO file {}'.format(filename))
            continue
        else:
            inf('Processing {}'.format(filename))
        with open(filename) as f:
            header = f.__next__()
            for line in f:
                fields = line.rstrip().split(';')
                for i in range(1, len(fields)//4):
                    (pn, pc) = (fields[2+4*i], fields[2+4*i+3])
                    if pn != '':
                        pc = pc.strip()
                        pn = int(pn)
                        phrases_seen[pn] += 1
                        if pc != '':
                            good = True
                            for i in [1]:
                                good = False
                                if pn in pf_corr:
                                    repeated[pn] += pc
                                    continue
                                if pc not in function_values:
                                    illegal_fvalue.add(pc)
                                    continue
                                if F.otype.v(pn) != 'phrase': 
                                    non_phrase.add(pn)
                                    continue
                                good = True
                            if good:
                                pf_corr[pn] = pc

        inf('{}: Found {:>5} corrections in {}'.format(verb, len(pf_corr), filename))
        if len(repeated):
            msg('ERROR: Some phrases have been corrected multiple times!')
            for x in sorted(repeated):
                msg('{:>6}: {}'.format(x, ', '.join(repeated[x])))
        else:
            inf('OK: Corrected phrases did not receive multiple corrections')
        if len(non_phrase):
            msg('ERROR: Corrections have been applied to non-phrase nodes: {}'.format(','.join(non_phrase)))
        else:
            inf('OK: all corrected nodes where phrase nodes')
        if len(illegal_fvalue):
            msg('ERROR: Some corrections supply illegal values for phrase function!')
            msg('`{}`'.format('`, `'.join(illegal_fvalue)))
        else:
            inf('OK: all corrected values are legal')
    inf('Found {} corrections in the phrase function'.format(len(pf_corr)))
        
read_corr()

stats = collections.Counter()
for (p, times) in phrases_seen.items(): stats[times] += 1
for (times, n) in sorted(stats.items(), key=lambda y: (-y[1], y[0])):
    inf('{:<6} phrases seen {:<2} time(s)'.format(n, times))
inf('Total phrases seen: {}'.format(len(phrases_seen)))

  6.14s NO file /Users/dirk/Dropbox/SYNVAR/corr_filled/oBR_etcbc4b.csv


  6.14s Processing /Users/dirk/Dropbox/SYNVAR/corr_filled/oFH_etcbc4b.csv
  6.18s <FH: Found   735 corrections in /Users/dirk/Dropbox/SYNVAR/corr_filled/oFH_etcbc4b.csv
  6.18s OK: Corrected phrases did not receive multiple corrections
  6.19s OK: all corrected nodes where phrase nodes
  6.19s OK: all corrected values are legal


  6.19s NO file /Users/dirk/Dropbox/SYNVAR/corr_filled/oLH_etcbc4b.csv


  6.19s Processing /Users/dirk/Dropbox/SYNVAR/corr_filled/BRa_etcbc4b.csv
  6.19s BR>: Found   739 corrections in /Users/dirk/Dropbox/SYNVAR/corr_filled/BRa_etcbc4b.csv
  6.19s OK: Corrected phrases did not receive multiple corrections
  6.19s OK: all corrected nodes where phrase nodes
  6.19s OK: all corrected values are legal
  6.19s Processing /Users/dirk/Dropbox/SYNVAR/corr_filled/BWa_etcbc4b.csv
  6.24s BW>: Found   794 corrections in /Users/dirk/Dropbox/SYNVAR/corr_filled/BWa_etcbc4b.csv
  6.24s OK: Corrected phrases did not receive multiple corrections
  6.24s OK: all corrected nodes where phrase nodes
  6.24s OK: all corrected values are legal
  6.25s Processing /Users/dirk/Dropbox/SYNVAR/corr_filled/CJT_etcbc4b.csv
  6.25s CJT: Found   797 corrections in /Users/dirk/Dropbox/SYNVAR/corr_filled/CJT_etcbc4b.csv
  6.25s OK: Corrected phrases did not receive multiple corrections
  6.25s OK: all corrected nodes where phrase nodes
  6.25s OK: all corrected values are legal
  6.25s Pr

  6.34s NO file /Users/dirk/Dropbox/SYNVAR/corr_filled/JRD_etcbc4b.csv
  6.35s NO file /Users/dirk/Dropbox/SYNVAR/corr_filled/JYa_etcbc4b.csv


  6.35s Processing /Users/dirk/Dropbox/SYNVAR/corr_filled/NFa_etcbc4b.csv
  6.36s NF>: Found  1224 corrections in /Users/dirk/Dropbox/SYNVAR/corr_filled/NFa_etcbc4b.csv
  6.36s OK: Corrected phrases did not receive multiple corrections
  6.36s OK: all corrected nodes where phrase nodes
  6.36s OK: all corrected values are legal


  6.37s NO file /Users/dirk/Dropbox/SYNVAR/corr_filled/NPL_etcbc4b.csv


  6.37s Processing /Users/dirk/Dropbox/SYNVAR/corr_filled/NTN_etcbc4b.csv
  6.41s NTN: Found  1368 corrections in /Users/dirk/Dropbox/SYNVAR/corr_filled/NTN_etcbc4b.csv
  6.41s OK: Corrected phrases did not receive multiple corrections
  6.41s OK: all corrected nodes where phrase nodes
  6.42s OK: all corrected values are legal
  6.42s Processing /Users/dirk/Dropbox/SYNVAR/corr_filled/NWS_etcbc4b.csv
  6.43s NWS: Found  1379 corrections in /Users/dirk/Dropbox/SYNVAR/corr_filled/NWS_etcbc4b.csv
  6.43s OK: Corrected phrases did not receive multiple corrections
  6.43s OK: all corrected nodes where phrase nodes
  6.43s OK: all corrected values are legal
  6.43s Processing /Users/dirk/Dropbox/SYNVAR/corr_filled/PQD_etcbc4b.csv
  6.44s PQD: Found  1405 corrections in /Users/dirk/Dropbox/SYNVAR/corr_filled/PQD_etcbc4b.csv
  6.44s OK: Corrected phrases did not receive multiple corrections
  6.44s OK: all corrected nodes where phrase nodes
  6.44s OK: all corrected values are legal
  6.44s Pr

# 5. Enrichment

We create blank sheets for new feature assignments, based on the corrected data.

In [15]:
enrich_field_spec = '''
valence
    adjunct
    complement
    core

predication
    NA
    regular
    copula

grammatical
    NA
    subject
    principal_direct_object
    direct_object
    NP_direct_object
    indirect_object
    L_object
    K_object
    infinitive_object
    *

original
    NA
    subject
    principal_direct_object
    direct_object
    NP_direct_object
    indirect_object
    L_object
    K_object
    infinitive_object
    *

lexical
    location
    time

semantic
    benefactive
    time
    location
    instrument
    manner
'''
enrich_fields = collections.OrderedDict()
cur_e = None
for line in enrich_field_spec.strip().split('\n'):
    if line.startswith(' '):
        enrich_fields.setdefault(cur_e, set()).add(line.strip())
    else:
        cur_e = line.strip()
nef = len(enrich_fields)
if None in enrich_fields:
    msg('Invalid enrich field specification')
else:
    inf('{} Enrich field specifications OK'.format(nef))
for ef in enrich_fields:
    inf('{} = {{{}}}'.format(ef, ', '.join(sorted(enrich_fields[ef]))), withtime=False)

  6.54s 6 Enrich field specifications OK
valence = {adjunct, complement, core}
predication = {NA, copula, regular}
grammatical = {*, K_object, L_object, NA, NP_direct_object, direct_object, indirect_object, infinitive_object, principal_direct_object, subject}
original = {*, K_object, L_object, NA, NP_direct_object, direct_object, indirect_object, infinitive_object, principal_direct_object, subject}
lexical = {location, time}
semantic = {benefactive, instrument, location, manner, time}


In [16]:
enrich_baseline_rules = dict(
    phrase='''Adju	Adjunct	adjunct	NA	NA			
Cmpl	Complement	complement	NA	*			
Conj	Conjunction	NA	NA	NA		NA	NA
EPPr	Enclitic personal pronoun	NA	copula	NA			
ExsS	Existence with subject suffix	core	copula	subject			
Exst	Existence	core	copula	NA			
Frnt	Fronted element	NA	NA	NA		NA	NA
Intj	Interjection	NA	NA	NA		NA	NA
IntS	Interjection with subject suffix	core	NA	subject			
Loca	Locative	adjunct	NA	NA		location	location
Modi	Modifier	NA	NA	NA		NA	NA
ModS	Modifier with subject suffix	core	NA	subject			
NCop	Negative copula	core	copula	NA			
NCoS	Negative copula with subject suffix	core	copula	subject			
Nega	Negation	NA	NA	NA		NA	NA
Objc	Object	complement	NA	direct_object			
PrAd	Predicative adjunct	adjunct	NA	NA			
PrcS	Predicate complement with subject suffix	core	regular	subject			
PreC	Predicate complement	core	regular	NA			
Pred	Predicate	core	regular	NA			
PreO	Predicate with object suffix	core	regular	direct_object			
PreS	Predicate with subject suffix	core	regular	subject			
PtcO	Participle with object suffix	core	regular	direct_object			
Ques	Question	NA	NA	NA		NA	NA
Rela	Relative	NA	NA	NA		NA	NA
Subj	Subject	core	NA	subject			
Supp	Supplementary constituent	adjunct	NA	NA			benefactive
Time	Time reference	adjunct	NA	NA		time	time
Unkn	Unknown	NA	NA	NA		NA	NA
Voct	Vocative	NA	NA	NA		NA	NA''',
    clause='''Objc	Object	complement	NA	direct_object			
InfC	Infinitive Construct clause	NA	NA				''',
)

In [17]:
transform = collections.OrderedDict((('phrase', {}), ('clause', {})))
errors = 0
good = 0

for kind in ('phrase', 'clause'):
    for line in enrich_baseline_rules[kind].split('\n'):
        x = line.split('\t')
        nefields = len(x) - 2
        if len(x) - 2 != nef:
            msg('Wrong number of fields ({} must be {}) in {}:\n{}'.format(nefields, nef, kind, line))
            errors += 1
        transform[kind][x[0]] = dict(zip(enrich_fields, x[2:]))
    for e in error_values['function']:
        transform[kind][e] = dict(zip(enrich_fields, ['']*nef))

    for f in transform[kind]:
        for e in enrich_fields:
            val = transform[kind][f][e]
            if val != '' and val != 'NA' and val not in enrich_fields[e]:
                msg('Defaults for `{}` ({}): wrong `{}` value: "{}"'.format(f, kind, e, val))
                errors += 1
            else: good += 1
if errors:
    msg('There were {} errors ({} good)'.format(errors, good))
else:
    inf('Enrich baseline rules are OK ({} good)'.format(good))

  6.62s Enrich baseline rules are OK (204 good)


Let us prettyprint the baseline rules of enrichment for easier reference.

In [18]:
ltpl = '{:<8}: '+('{:<15}' * nef)
inf(ltpl.format('func', *enrich_fields), withtime=False)
for kind in transform:
    inf('[{}]'.format(kind), withtime=False)
    for f in sorted(transform[kind]):
        sfs = transform[kind][f]
        inf(ltpl.format(f, *[sfs[sf] for sf in enrich_fields]), withtime=False)

func    : valence        predication    grammatical    original       lexical        semantic       
[phrase]
Adju    : adjunct        NA             NA                                                          
BoundErr:                                                                                           
Cmpl    : complement     NA             *                                                           
Conj    : NA             NA             NA                            NA             NA             
EPPr    : NA             copula         NA                                                          
ExsS    : core           copula         subject                                                     
Exst    : core           copula         NA                                                          
Frnt    : NA             NA             NA                            NA             NA             
IntS    : core           NA             subject                                   

## 5.1 Enrichment logic

For certain verbs and certain conditions, we can automatically fill in some of the new features.
For example, if the verb is `CJT`, and if an adjunct phrase is personal, starting with `L`, we know that the semantic role is *benefactive*.

We will also analyse the direct and indirect objects more precisely and implement heuristics to make a distinction between complements (locative) and indirect objects.

### Finding the direct objects

In the target clauses we will find the direct object(s).
If there is more than one, we will compute which is the principal one.
The others are secundary ones.
If there is only one direct object, we do not mark it as principal.

An object can be a phrase or a clause.

### Clauses as objects
We will treat clauses marked as `Objc` by feature `rela` as direct objects.
Additionally, we identify clauses marked as `InfC` by feature `typ` as direct objects if they are preceded by the preposition *L* and if there is a direct object phrase elsewhere in the clause.

We will not mark all these object clauses as principal direct objects, by rules stated later on.

### Implied objects

There are many cases where there is a direct object without it being marked as such in the data.
Those are cases where there are no objective, unambiguous signals for a direct object.
We call them *implied objects*. Examples: 

* the relativum in relative clauses
* complements starting with MN (from) or L (to)

In the case of implied objects we have to guess.
Initially we assume that there are no implied objects.

Later, when we inspect individual cases, we can mark principal objects and implied objects manually
for those cases where these rules do not suffice.

### Finding the principal direct object

When there are multiple direct objects, we use the rules formulated by (Janet Dyk, Reinoud Oosting and Oliver Glanz, 2014) to determine which one is the principal one. The rules are stated below where we make some remarks about how we apply them to our data.

#### Interpretation

When looking for principal direct objects, we restrict ourselves to direct objects at the phrase level, either being complete phrases, or pronominal suffixes within phrases. The following rules express a preference for the principal direct object. In a given context, we select the direct object that is preferred by applying those rules as the principal direct object. We only apply these rules if there are at least two direct objects.
If there is only one direct object, it is not marked as principal.

#### Rule 1: pronominal suffixes > preferred above marked objects > unmarked objects

In a given clause, we collect all phrases with function ``PreO`` or ``PtcO``. 
If this collection is non-empty, we pick the one that is textually first (by rule 3 below) and stop applying rules.
Otherwise, we proceed as follows.

We collect all the phrases with function ``Objc``.
If this collection is empty, there will not be a principal object.
Otherwise, we split it up in marked and unmarked object phrases.

An object phrase is *marked* if and only if it contains, somewhere, the object marker ``>T``.
If there are marked object phrases, we pick the one that is textually first (by rule 3 below) and stop applying rules.
Otherwise we proceed with the next rule.

#### Rule 2: determined phrases > undetermined phrases

We only arrive here if there are multiple ``Objc`` phrases, neither of which is marked.
In this case, we take the textually first one (by rule 3) which has the value ``det`` for its feature ``det``, if there is one, and stop applying rules.
Otherwise we proceed with the next rule.

#### Rule 3: earlier phrases > later phrases (by textual order)

This rule is implicitly applied if one of the rules before yielded more than one candidate for the principal object. Furthermore, we arrive here if the previous rules have not selected any principal direct object, while we do have more than one ``Objc`` phrase.

In this case, we pick the textually first ``Objc`` phrase.

### Non principal objects

In case there is a principal object, we divide the other objects into two kinds:
* clause objects
* phrase objects

We will give the phrase objects the grammatical label `NP_direct_object`.

### Complements as LK Objects

In some cases, a complement functions as objects, such as in [Genesis 21:13](https://shebanq.ancient-data.org/hebrew/text?nget=v&chapter=21&book=Genesis&qw=n&tp=txt_tb1&version=4b&mr=m) *I make him (into) a people*.

Candidates are those complements that: 

* start with either preposition ``L`` or ``K`` and
* the ``L`` or ``K`` in question does not carry a pronominal suffix
* should also not be followed by a body part

We generated grammatical labels ``L_object`` and ``K_object`` in these cases.
The flowchart will make a distinction between ``L_object`` and ``K_object``.

An L/K object is never a *principal* direct object.

In [19]:
objectfuncs = set('''
Objc PreO PtcO
'''.strip().split())

cmpl_as_obj_preps = set('''
K L
'''.strip().split())

no_prs = set('''
absent n/a
'''.strip().split())

In [20]:
body_parts = set('''
>NP/ >P/ >PSJM/ >YB</ >ZN/
<JN/ <NQ/ <RP/ <YM/ <YM==/
BHN/ BHWN/ BVN/
CD=/ CD===/ CKM/ CN/
DD/
GRGRT/ GRM/ GRWN/ GW/ GW=/ GWJH/ GWPH/ GXWN/
FPH/
JD/ JRK/ JRKH/
KRF/ KSL=/ KTP/
L</ LCN/ LCWN/ LXJ/
M<H/ MPRQT/ MTL<WT/ MTNJM/ MYX/
NBLH=/
P<M/ PGR/ PH/ PM/ PNH/ PT=/
QRSL/
R>C/ RGL/
XDH/ XLY/ XMC=/ XRY/
YW>R/
ZRW</
'''.strip().split())

In [21]:
inf('Finding direct objects and determining the principal one')
clause_objects = collections.defaultdict(set)
objects = collections.defaultdict(set)
objects_count = collections.defaultdict(collections.Counter)
object_kinds = (
    'principal',
    'direct',
    'NP',
    'L',
    'K',
    'clause',
    'infinitive',
)

def is_marked(phr):
    # simple criterion for determining whether a direct object is marked:
    # has it the object marker somewhere?
    words = L.d('word', p)
    has_et = False
    for w in words:
        if F.lex.v(w) == '>T':
            has_et = True
            break
    return has_et

for c in clause_verb:
    these_objects = collections.defaultdict(set)
    direct_objects_cat = collections.defaultdict(set)

    for p in L.d('phrase', c):
        pf = pf_corr.get(p, F.function.v(p))  # NB we take the corrected value for phrase function if there is one
        if pf in objectfuncs:
            direct_objects_cat['p_'+pf].add(p)
            these_objects['direct'].add(p)
        elif pf == 'Cmpl':
            pwords = L.d('word', p)
            w1 = pwords[0]
            w1l = F.lex.v(w1)
            w2l = F.lex.v(pwords[1]) if len(pwords) > 1 else None
            if w1l in cmpl_as_obj_preps and F.prs.v(w1) in no_prs and not (w1l == 'L' and w2l in body_parts):
                if w1l == 'K': these_objects['K'].add(p)
                elif w1l == 'L': these_objects['L'].add(p)
        
    # find clause objects
    for ac in L.d('clause', L.u('sentence', c)):
        mothers = list(C.mother.v(ac))
        if not (mothers and mothers[0] == c): continue
        cr = F.rela.v(ac)
        ct = F.typ.v(ac)
        if cr in {'Objc'} or ct in {'InfC'}:
            clause_objects[c].add(ac)
            if cr in {'Objc'}:
                label = cr
                direct_objects_cat['c_'+label].add(ac)
                these_objects['direct'].add(ac)
                these_objects['clause'].add(ac)
            elif ct in {'InfC'}:
                if F.lex.v(L.d('word', ac)[0]) == 'L':
                    these_objects['infinitive'].add(ac)
        else:
            continue

    # order the objects in the natural ordering
    direct_objects_order = sorted(these_objects.get('direct', set()), key=NK)
    nobjects = len(direct_objects_order)

    # compute the principal object
    principal_object = None

    for x in [1]:
        # just one object 
        if nobjects == 1:
            # we have chosen not to mark a principal object if there is only one object
            # the alternative is to mark it if it is a phrase. Uncomment the next 2 lines if you want this
            # theobject = list(dobjects_set)[0]
            # if F.otype.v(theobject) == 'phrase': principal_object = theobject
            break
        # rule 1: suffixes and promoted objects
        principal_candidates =\
            direct_objects_cat.get('p_PreO', set()) |\
            direct_objects_cat.get('p_PtcO', set())
        if len(principal_candidates) != 0:
            principal_object = sorted(principal_candidates, key=NK)[0]
            break
        principal_candidates = direct_objects_cat.get('p_Objc', set())
        if len(principal_candidates) != 0:
            if len(principal_candidates) == 1:
                principal_object = list(principal_candidates)[0]
                break
            objects_marked = set()
            objects_unmarked = set()
            for p in principal_candidates:
                if is_marked(p):
                    objects_marked.add(p)
                else:
                    objects_unmarked.add(p)
            if len(objects_marked) != 0:
                principal_object = sorted(objects_marked, key=NK)[0]
                break
            if len(objects_unmarked) != 0:
                principal_object = sorted(objects_unmarked, key=NK)[0]
                break            
    if principal_object != None:
        these_objects['principal'].add(principal_object)
    if len(these_objects['infinitive']) and not len(these_objects['direct']):
        # we do not mark an infinitive object if there is no proper direct object around
        these_objects['infinitive'] = set()
    if len(these_objects['principal']):
        these_objects['direct'] -= these_objects['principal']
        for x in these_objects['direct'] - these_objects['clause']:
            # the NP objects are the non-principal phrase like direct objects
            these_objects['NP'].add(x)
        these_objects['direct'] -= these_objects['NP']

    for kind in object_kinds:
        n = len(these_objects.get(kind, set()))
        objects_count[kind][n] += 1
        if n:
            objects[kind] |= these_objects[kind]

inf('Done')

for kind in object_kinds:
    total = 0
    for (count, n) in sorted(objects_count[kind].items(), key=lambda y: -y[0]):
        if count: total += n
        inf('{:>5} clauses with {:>2} {:<10} objects'.format(n, count, kind), withtime=False)
    inf('{:>5} clauses with {:>2} {:<10} objects'.format(total, 'a', kind), withtime=False)
inf('{:>5} clauses with {:>2} selected verb'.format(len(clause_verb), 'a'), withtime=False)

  7.16s Finding direct objects and determining the principal one
  7.98s Done
  447 clauses with  1 principal  objects
15433 clauses with  0 principal  objects
  447 clauses with  a principal  objects
    2 clauses with  2 direct     objects
 6014 clauses with  1 direct     objects
 9864 clauses with  0 direct     objects
 6016 clauses with  a direct     objects
  433 clauses with  1 NP         objects
15447 clauses with  0 NP         objects
  433 clauses with  a NP         objects
   21 clauses with  2 L          objects
 1041 clauses with  1 L          objects
14818 clauses with  0 L          objects
 1062 clauses with  a L          objects
   62 clauses with  1 K          objects
15818 clauses with  0 K          objects
   62 clauses with  a K          objects
    2 clauses with  2 clause     objects
  180 clauses with  1 clause     objects
15698 clauses with  0 clause     objects
  182 clauses with  a clause     objects
    5 clauses with  2 infinitive objects
  342 clauses with  

### Finding indirect objects

The ETCBC database has not feature that marks indirect objects.
We will use computation to determine whether a complement is an indirect object or a locative.
This computation is just an approximation.

#### Cues for a locative complement

* ``# loc lexemes`` how many distinct lexemes with a locative meaning occur in the complement (given by a fixed list)
* ``# topo`` how many lexemes with nametype = ``topo`` occur in the complement (nametype is a feature of the lexicon)
* ``# prep_b`` how many occurrences of the preposition ``B`` occur in the complement
* ``# h_loc`` how many H-locales are carried on words in the complement
* ``body_part`` is 2 if the phrase starts with the preposition ``L`` followed by a body part, else 0
* ``locativity`` ($loc$) a crude measure of the locativity of the complement, just the sum of ``# loc lexemes``, ``#topo``, ``# prep_b``, ``# h_loc`` and ``body_part``.

#### Cues for an indirect object
* ``# prep_l`` how many occurrences of the preposition ``L`` or ``>L`` with a pronominal suffix on it occur in the complement
* ``# L prop`` how many occurrences of ``L`` or ``>L`` plus proper name or person reference word occur in the complement
* ``indirect object`` ($ind$) a crude indicator of whether the complement is an indirect object, just the sum of ``# prep_l`` and ``# L prop`` 

#### The decision

We take a decision as follows.
The outcome is $L$ (complement is *locative*) or $I$ (complement is *indirect object*) or $C$ (complement is neither *locative* nor *indirect object*)

(1) $ loc > 0 \wedge ind = 0 \Rightarrow L $

(2) $ loc = 0 \wedge ind > 0 \Rightarrow I $

(3) $ loc > 0 \wedge ind > 0 \wedge\ loc - 1 > ind \Rightarrow L$

(4) $ loc > 0 \wedge ind > 0 \wedge\ loc + 1 < ind \Rightarrow I$

(5) $ loc > 0 \wedge ind > 0 \wedge |ind - loc| <= 1 \Rightarrow C$

In words:

* if there are positive signals for L or I and none for the other, we choose the one for which there are positive signals;
* if there are positive signals for both L and I, we follow the majority count, but only if the difference is at least two;
* in all other cases we leave it at C: not necessarilty locative and not necessarily indirect object.

In [22]:
complfuncs = set('''
Cmpl PreC
'''.strip().split())

cmpl_as_iobj_preps = set('''
L >L
'''.strip().split())

In [23]:
locative_lexemes = set('''
>RY/ >YL/ >XR/
<BR/ <BRH/ <BWR/ <C==/ <JR/ <L=/ <LJ=/ <LJH/ <LJL/ <MD=/ <MDH/ <MH/ <MQ/ <MQ===/ <QB/
BJT/
CM CMJM/ CMC/ C<R/
DRK/
FDH/
HR/
JM/ JRDN/ JRWCLM/ JFR>L/
MDBR/ MW<D/ MWL/ MZBX/ MYRJM/ MQWM/ MR>CWT/ MSB/ MSBH/ MVH==/
QDM/
SBJB/
TJMN/ TXT/ TXWT/
YPWN/
'''.strip().split())

personal_lexemes = set('''
>B/ >CH/ >DM/ >DRGZR/ >DWN/ >JC/ >J=/ >KR/ >LJL/ >LMN=/ >LMNH/ >LMNJ/ >LWH/ >LWP/ >M/ 
>MH/ >MN==/ >MWN=/ >NC/ >NWC/ >PH/ >PRX/ >SJR/ >SJR=/ >SP/ >X/ >XCDRPN/
>XWH/ >XWT/
<BDH=/ <CWQ/ <D=/ <DH=/ <LMH/ <LWMJM/ <M/ <MD/ <MJT/ <QR=/ <R/ <WJL/ <WL/ <WL==/ <WLL/
<WLL=/ <YRH/
B<L/ B<LH/ BKJRH/ BKR/ BN/ BR/ BR===/ BT/ BTWLH/ BWQR/ BXRJM/ BXWN/ BXWR/
CD==/ CDH/ CGL/ CKN/ CLCJM/ CLJC=/ CMRH=/ CPXH/ CW<R/ CWRR/
DJG/ DWD/ DWDH/ DWG/ DWR/
F<JR=/ FB/ FHD/ FR/ FRH/ FRJD/ FVN/
GBJRH/ GBR/ GBR=/ GBRT/ GLB/ GNB/ GR/ GW==/ GWJ/ GZBR/
HDBR/ 
J<RH/ JBM/ JBMH/ JD<NJ/ JDDWT/ JLD/ JLDH/ JLJD/ JRJB/ JSWR/ JTWM/ JWYR/
JYRJM/ 
KCP=/ KHN/ KLH/ KMR/ KN<NJ=/ KNT/ KRM=/ KRWB/ KRWZ/
L>M/ LHQH/ LMD/ LXNH/
M<RMJM/ M>WRH/ MCBR/ MCJX/ MCM<T/ MCMR/ MCPXH/ MCQLT/ MD<=/ MD<T/ MG/
MJNQT/ MKR=/ ML>K/ MLK/ MLKH/ MLKT/ MLX=/ MLYR/ MMZR/ MNZRJM/ MPLYT/
MPY=/ MQHL/ MQY<H/ MR</ MR>/ MSGR=/ MT/ MWRH/ MYBH=/
N<R/ N<R=/ N<RH/ N<RWT/ N<WRJM/ NBJ>/ NBJ>H/ NCJN/ NFJ>/ NGJD/ NJN/ NKD/ 
NKR/ NPC/ NPJLJM/ NQD/ NSJK/ NTJN/ 
PLGC/ PLJL/ PLJV/ PLJV=/ PQJD/ PR<H/ PRC/ PRJY/ PRJY=/ PRTMJM/ PRZWN/ 
PSJL/ PSL/ PVR/ PVRH/ PXH/ PXR/
QBYH/ QCRJM/ QCT=/ QHL/ QHLH/ QHLT/ QJM/ QYJN/
R<H=/ R<H==/ R<JH/ R<=/ R<WT/ R>H/ RB</ RB=/ RB==/ RBRBNJN/ RGMH/ RHB/ RKB=/
RKJL/ RMH/ RQX==/ 
SBL/ SPR=/ SRJS/ SRK/ SRNJM/ 
T<RWBWT/ TLMJD/ TLT=/ TPTJ/ TR<=/ TRCT>/ TRTN/ TWCB/ TWL<H/ TWLDWT/ TWTX/
VBX/ VBX=/ VBXH=/ VPSR/ VPXJM/
WLD/
XBL==/ XBL======/ XBR/ XBR=/ XBR==/ XBRH/ XBRT=/ XJ=/ XLC/ XM=/ XMWT/
XMWY=/ XNJK/ XR=/ XRC/ XRC====/ XRP=/ XRVM/ XTN/ XTP/ XZH=/
Y<JRH/ Y>Y>JM/ YJ/ YJD==/ YJR==/ YR=/ YRH=/ 
ZKWR/ ZMR=/ ZR</
'''.strip().split())

In [24]:
inf('Determinig kind of complements')

complements_c = collections.defaultdict(lambda: collections.defaultdict(lambda: []))
complements = {}
complementk = {}
kcomplements = collections.Counter()

nphrases = 0
ncomplements = 0

for c in clause_verb:
    for p in L.d('phrase', c):
        nphrases += 1
        pf = pf_corr.get(p, F.function.v(p))
        if pf not in complfuncs: continue
        ncomplements += 1
        words = L.d('word', p)
        lexemes = [F.lex.v(w) for w in words]
        lexeme_set = set(lexemes)

        # measuring locativity
        lex_locativity = len(locative_lexemes & lexeme_set)
        prep_b = len([x for x in lexeme_set if x == 'B'])
        topo = len([x for x in words if F.nametype.v(x) == 'topo'])
        h_loc = len([x for x in words if F.uvf.v(x) == 'H'])
        body_part = 0
        if len(words) > 1 and F.lex.v(words[0]) == 'L' and F.lex.v(words[1]) in body_parts:
            body_part = 2
        loca = lex_locativity + topo + prep_b + h_loc + body_part

        # measuring indirect object
        prep_l = len([x for x in words if F.lex.v(x) in cmpl_as_iobj_preps and F.prs.v(x) not in no_prs])
        prep_lpr = 0
        lwn = len(words)
        for (n, wn) in enumerate(words):
            if F.lex.v(wn) in cmpl_as_iobj_preps:
                if n+1 < lwn:
                    nextw = words[n+1]
                    if F.lex.v(nextw) in personal_lexemes or F.ls.v(nextw) == 'gntl' or (
                        F.sp.v(nextw) == 'nmpr' and F.nametype.v(nextw) == 'pers'):
                        prep_lpr += 1                        
        indi = prep_l + prep_lpr

        # the verdict
        ckind = 'C'
        if loca == 0 and indi > 0: ckind = 'I'
        elif loca > 0 and indi == 0: ckind = 'L'
        elif loca > indi + 1: ckind = 'L'
        elif loca < indi - 1: ckind = 'I'
        complementk[p] = (loca, indi, ckind)
        kcomplements[ckind] += 1
        complements_c[c][ckind].append(p)
        complements[p] = (pf, ckind)

inf('Done')
for (label, n) in sorted(kcomplements.items(), key=lambda y: -y[1]):
    inf('Phrases of kind {:<2}: {:>6}'.format(label, n), withtime=False)
inf('Total complements : {:>6}'.format(ncomplements), withtime=False)
inf('Total phrases     : {:>6}'.format(nphrases), withtime=False)

  8.18s Determinig kind of complements
  8.68s Done
Phrases of kind L :   4448
Phrases of kind C :   4219
Phrases of kind I :   1709
Total complements :  10376
Total phrases     :  51953


In [25]:
def has_L(vl, pn):
    words = L.d('word', pn)
    return len(words) > 0 and F.lex.v(words[0] == 'L')

def is_lex_personal(vl, pn):
    words = L.d('word', pn)
    return len(words) > 1 and F.lex.v(words[1] in personal_lexemes)

def is_lex_local(vl, pn):
    words = L.d('word', pn)
    return len({F.lex.v(w) for w in words} & locative_lexemes) > 0

def has_H_locale(vl, pn):
    words = L.d('word', pn)
    return len({w for w in words if F.uvf.v(w) == 'H'}) > 0  

### Generic logic

This is the function that applies the generic rules about (in)direct objects and locatives.
It takes a phrase node and a set of new label values, and modifies those values.

In [26]:
grule_as_str = {
    'pdos':   '''direct_object => principal_direct_object''',
    'pdos-x': '''non-object => principal_direct_object''',
    'ndos':   '''direct_object => NP_direct_object''',
    'ndos-x': '''non-object => NP_direct_object''',
    'dos':    '''non-object => direct_object''',
    'ldos':   '''non-object => L_object''',
    'kdos':   '''non-object => K_object''',
    'inds-c': '''complement => indirect_object''',
    'locs-c': '''complement => location''',
    'inds-p': '''predicate complement => indirect_object''',
    'locs-p': '''predicate complement => location''',
    'cdos':   '''direct-object =(superfluously)=> direct object (clause)''',
    'cdos-x': '''non-object => direct object (clause)''',
    'idos':   '''infinitive_object =(superfluously)=> infinitive_object (clause)''',
    'idos-x': '''infinitive clause => infinitive_object''',
}

def rule_as_str_g(x, i): return '{}-{}'.format(i, grule_as_str[i])

rule_as_str = dict(
    generic=rule_as_str_g,
)

def generic_logic_p(pn, values):
    gl = None
    if pn in objects['principal']:
        oldv = values['grammatical']
        if oldv == 'direct_object':
            gl = 'pdos'
        else:
            gl = 'pdos-x'
            values['original'] = oldv
        values['grammatical'] = 'principal_direct_object'
    elif pn in objects['NP']:
        oldv = values['grammatical']
        if oldv == 'direct_object':
            gl = 'ndos'
        else:
            gl = 'ndos-x'
            values['original'] = oldv
        values['grammatical'] = 'NP_direct_object'
    elif pn in objects['direct']:
        oldv = values['grammatical']
        if oldv != 'direct_object':
            gl = 'dos'
            values['original'] = oldv
            values['grammatical'] = 'direct_object'
    elif pn in objects['L']:
        oldv = values['grammatical']
        gl = 'ldos'
        values['original'] = oldv
        values['grammatical'] = 'L_object'
    elif pn in objects['K']:
        oldv = values['grammatical']
        gl = 'kdos'
        values['original'] = oldv
        values['grammatical'] = 'K_object'
    elif pn in complements:
        (pf, ck) = complements[pn]
        if ck in {'I', 'L'}:
            if pf == 'Cmpl':
                if ck == 'I':
                    values['grammatical'] = 'indirect_object'
                    gl = 'inds-c'
                else:
                    values['valence'] = 'adjunct'
                    values['lexical'] = 'location'
                    values['semantic'] = 'location'
                    gl = 'locs-c'
            elif pf == 'PreC':
                if ck == 'I':
                    values['grammatical'] = 'indirect_object'
                    gl = 'inds-p'
                else:
                    values['lexical'] = 'location'
                    values['semantic'] = 'location'
                    gl = 'locs-p'
    return gl

def generic_logic_c(cn, values):
    gl = None
    if cn in objects['clause']:
        oldv = values['grammatical']
        if oldv == 'direct_object':
            gl = 'cdos'
        else:
            gl = 'cdos-x'
            values['original'] = oldv
        values['grammatical'] = 'direct_object'
    elif cn in objects['infinitive']:
        oldv = values['grammatical']
        if oldv == 'infinitive_object':
            gl = 'idos'
        else:
            gl = 'idos-x'
            values['original'] = oldv
        values['grammatical'] = 'infinitive_object'
    return gl

generic_logic = dict(
    phrase=generic_logic_p,
    clause=generic_logic_c,
)


### 5.1.1 Verb specific rules

The verb-specific enrichment rules are stored in a dictionary, keyed  by the verb lexeme.
The rule itself is a list of items.

The last item is a tuple of conditions that need to be fulfilled to apply the rule.

A condition can take the shape of

* a function, taking a phrase or clause node as argument and returning a boolean value
* an ETCBC feature for phrases or clauses : value, 
  which is true iff that feature has that value for the phrase or clause in question

In [27]:
enrich_logic = dict(
    phrase={
        'CJT': [
            (
                ('semantic', 'benefactive'), 
                ('function:Adju', has_L, is_lex_personal),
            ),
            (
                ('lexical', 'location'),
                ('function:Cmpl', has_H_locale),
            ),
            (
                ('lexical', 'location'),
                ('semantic', 'location'),
                ('function:Cmpl', is_lex_local),
            ),
        ],
    },
    clause={
    },
)

In [28]:
rule_index = collections.defaultdict(lambda: [])

def rule_as_str_s(vl, i):
    (conditions, sfassignments) = rule_index[vl][i]
    label = '{}-{}\n'.format(vl, i+1)
    rule = '\tIF   {}'.format('\n\tAND  '.join(
        '{:<10} = {:<8}'.format(
                *c.split(':')
            ) if type(c) is str else '{:<15}'.format(
                c.__name__
            ) for c in conditions,
    ))
    ass = []
    for (i, sfa) in enumerate(sfassignments):
        ass.append('\t\t{:<10} => {:<15}\n'.format(*sfa))
    return '{}{}\n\tTHEN\n{}'.format(label, rule, ''.join(ass))

rule_as_str['specific'] = rule_as_str_s

def check_logic():
    errors = 0
    nrules = 0
    for kind in sorted(enrich_logic):
        for vl in sorted(enrich_logic[kind]):
            for items in enrich_logic[kind][vl]:
                rule_index[vl].append((items[-1], items[0:-1]))
            for (i, (conditions, sfassignments)) in enumerate(rule_index[vl]):
                inf(rule_as_str_s(vl, i), withtime=False)
                nrules += 1
                for (sf, sfval) in sfassignments:
                    if sf not in enrich_fields:
                        msg('{}: "{}" not a valid enrich field'.format(kind, sf), withtime=False)
                        errors += 1
                    elif sfval not in enrich_fields[sf]:
                        msg('{}: `{}`: "{}" not a valid enrich field value'.format(kind, sf, sfval), withtime=False)
                        errors += 1
                for c in conditions:
                    if type(c) == str:
                        x = c.split(':')
                        if len(x) != 2:
                            msg('{}: Wrong feature condition {}'.format(kind, c), withtime=False)
                            errors += 1
                        else:
                            (feat, val) = x
                            if feat not in legal_values:
                                msg('{}: Feature `{}` not in use'.format(kind, feat), withtime=False)
                                errors += 1
                            elif val not in legal_values[feat]:
                                msg('{}: Feature `{}`: not a valid value "{}"'.format(kind, feat, val), withtime=False)
                                errors += 1
    if errors:
        msg('There were {} errors in {} rules'.format(errors, nrules), withtime=False)
    else:
        inf('All {} rules OK'.format(nrules), withtime=False)

check_logic()

CJT-1
	IF   function   = Adju    
	AND  has_L          
	AND  is_lex_personal
	THEN
		semantic   => benefactive    

CJT-2
	IF   function   = Cmpl    
	AND  has_H_locale   
	THEN
		lexical    => location       

CJT-3
	IF   function   = Cmpl    
	AND  is_lex_local   
	THEN
		lexical    => location       
		semantic   => location       

All 3 rules OK


In [29]:
rule_cases = collections.defaultdict(lambda: collections.defaultdict(lambda: {}))

def apply_logic(kind, vl, n, init_values):
    values = deepcopy(init_values)
    gr = generic_logic[kind](n, values)
    if gr:
        rule_cases['generic'][kind].setdefault(('', gr), []).append(n)
    verb_rules = enrich_logic[kind].get(vl, [])
    for (i, items) in enumerate(verb_rules):
        conditions = items[-1]
        sfassignments = items[0:-1]

        ok = True
        for condition in conditions:
            if type(condition) is str:
                (feature, value) = condition.split(':')
                if feature == 'function' and kind == 'phrase':
                    fval = pf_corr.get(n, F.function.v(n))
                else:
                    fval = F.item[feature].v(n)
                this_ok =  fval == value
            else:
                this_ok = condition(vl, n)
            if not this_ok:
                ok = False
                break
        if ok:
            for (sf, sfval) in sfassignments:
                values[sf] = sfval
            rule_cases['specific'][kind].setdefault((vl, i), []).append(n)
    return tuple(values[sf] for sf in enrich_fields)

In [30]:
COMMON_FIELDS = '''
    cnode#
    vnode#
    onode#
    book
    chapter
    verse
    verb_lexeme
    verb_stem
    verb_occurrence
    text
    constituent
'''.strip().split()

PHRASE_FIELDS = '''
    type
    function
'''.strip().split()

CLAUSE_FIELDS = '''
    type
    rela
'''.strip().split()

field_names = COMMON_FIELDS + PHRASE_FIELDS + CLAUSE_FIELDS + list(enrich_fields) 
pfillrows = len(CLAUSE_FIELDS)
cfillrows = len(PHRASE_FIELDS)
fillrows =  pfillrows + cfillrows
inf('\n'.join(field_names), withtime=False)    

cnode#
vnode#
onode#
book
chapter
verse
verb_lexeme
verb_stem
verb_occurrence
text
constituent
type
function
type
rela
valence
predication
grammatical
original
lexical
semantic


In [31]:
seen = collections.defaultdict(collections.Counter)

def gen_sheet_enrich(verb):
    rows = []
    fieldsep = ';'
    clauses_seen = set()
    for wn in occs[verb]:
        cn = L.u('clause', wn)
        if cn in clauses_seen: continue
        clauses_seen.add(cn)
        vn = L.u('verse', wn)
        bn = L.u('book', wn)
        book = T.book_name(bn, lang='en')
        chapter = F.chapter.v(vn)
        verse = F.verse.v(vn)
        ln = ln_base+(ln_tpl.format(T.book_name(bn, lang='la'), chapter, verse))+ln_tweak
        vl = F.lex.v(wn).rstrip('[=')
        vstem = F.vs.v(wn)
        vt = T.words([wn], fmt='ec').replace('\n', '')
        ct = T.words(L.d('word', cn), fmt='ec').replace('\n', '')
        
        common_fields = (cn, wn, -1, book, chapter, verse, vl, vstem, vt, ct, '')
        rows.append(common_fields + (('',)*fillrows))
        for pn in L.d('phrase', cn):
            seen['phrase'][pn] += 1
            pt = T.words(L.d('word', pn), fmt='ec').replace('\n', '')
            common_fields = (cn, wn, pn, book, chapter, verse, vl, vstem, '', pt, 'phrase')
            pty = F.typ.v(pn)
            pf = pf_corr.get(pn, F.function.v(pn))
            phrase_fields =\
                ('',)*pfillrows +\
                (pty, pf) +\
                apply_logic('phrase', vl, pn, transform['phrase'][pf])            
            rows.append(common_fields + phrase_fields)
        for scn in clause_objects[cn]:
            seen['clause'][scn] += 1
            sct = T.words(L.d('word', scn), fmt='ec').replace('\n', '')
            common_fields = (cn, wn, scn, book, chapter, verse, vl, vstem, '', sct, 'clause')
            scty = F.typ.v(scn)
            scr = F.rela.v(scn)
            clause_fields =\
                (scty, scr) +\
                ('',)*cfillrows +\
                apply_logic('clause', vl, scn, transform['clause'][scr if scr == 'Objc' else scty])       
            rows.append(common_fields + clause_fields)

    filename = vfile(verb, 'enrich_blank')
    row_file = open(filename, 'w')
    row_file.write('{}\n'.format(fieldsep.join(field_names)))
    for row in rows:
        row_file.write('{}\n'.format(fieldsep.join(str(x) for x in row)))
    row_file.close()
    inf('Generated enrichment sheet for verb {} ({:>5} rows)'.format(verb, len(rows)), withtime=False)
    
for verb in verbs: gen_sheet_enrich(verb)

inf('Done')
for scope in rule_cases:
    totalscope = 0
    for kind in rule_cases[scope]:
        inf('{}-{} rules:'.format(scope, kind), withtime=False)
        totalkind = 0
        for rule_spec in rule_cases[scope][kind]:
            cases = rule_cases[scope][kind][rule_spec]
            n = len(cases)
            totalscope += n
            totalkind += n
            if scope == 'generic':
                inf('{:>4} x\n\t{}\n\t{}\n'.format(
                    n, rule_as_str[scope](*rule_spec), 
                    ', '.join(str(c) for c in cases[0:10]),
                ), withtime=False)
            else:                
                inf('{:>4} x\n\t{}\n\t{}\n'.format(
                    n, rule_as_str[scope](*rule_spec),
                    ', '.join(str(c) for c in cases[0:10]),
                ), withtime=False)
        inf('{:>6} {}-{} rule applications'.format(totalkind, scope, kind), withtime=False)
    inf('{:>6} {} rule applications'.format(totalscope, scope), withtime=False)

for kind in seen:
    stats = collections.Counter()
    for (node, times) in seen[kind].items(): stats[times] += 1
    for (times, n) in sorted(stats.items(), key=lambda y: (-y[1], y[0])):
        inf('{:>6} {} seen {:<2} time(s)'.format(n, kind, times), withtime=False)
    inf('{:>6} {} seen in total'.format(len(seen[kind]), kind), withtime=False)

Generated enrichment sheet for verb BW> (11102 rows)
Generated enrichment sheet for verb QR> ( 3332 rows)
Generated enrichment sheet for verb NPL ( 1932 rows)
Generated enrichment sheet for verb BR> (  198 rows)
Generated enrichment sheet for verb JRD ( 1593 rows)
Generated enrichment sheet for verb JY> ( 4630 rows)
Generated enrichment sheet for verb <LH ( 3893 rows)
Generated enrichment sheet for verb NF> ( 2875 rows)
Generated enrichment sheet for verb <FH (11323 rows)
Generated enrichment sheet for verb CJT (  381 rows)
Generated enrichment sheet for verb HLK ( 5819 rows)
Generated enrichment sheet for verb NTN ( 9846 rows)
Generated enrichment sheet for verb NWS (  617 rows)
Generated enrichment sheet for verb <BR ( 2320 rows)
Generated enrichment sheet for verb PQD ( 1283 rows)
Generated enrichment sheet for verb CWB ( 4247 rows)
Generated enrichment sheet for verb FJM ( 2915 rows)
Generated enrichment sheet for verb SWR ( 1281 rows)
    15s Done
specific-phrase rules:
   1 x
	CJ

In [32]:
def showcase(n):
    otype = F.otype.v(n)
    att1 = pf_corr.get(n, F.function.v(n)) if otype == 'phrase' else F.rela.v(n)
    att2 = F.typ.v(n)
    inf('''{} ({}-{}) {}\n{}'''.format(
        otype, att1, att2,
        T.words(L.d('word', n), fmt='ec'), 
        T.text(
            book=F.book.v(L.u('book', n)), 
            chapter=int(F.chapter.v(L.u('chapter', n))),
            verse=int(F.verse.v(L.u('verse', n))), 
            fmt='ec', lang='la',
        ),
    ), withtime=False)

In [33]:
showcase(654844)
showcase(445014)
showcase(432952)

phrase (PreC-NP) PQWDJ HLWJ LM#PXTM 
Numeri 26:57	W>LH PQWDJ HLWJ LM#PXTM LGR#WN M#PXT HGR#NJ LQHT M#PXT HQHTJ LMRRJ M#PXT HMRRJ00

clause (Adju-InfC) L<#WT 
Deuteronomium 9:18	W>TNPL LPNJ JHWH KR>#NH >RB<JM JWM W>RB<JM LJLH LXM L> >KLTJ WMJM L> #TJTJ <L KL&XV>TKM >#R XV>TM L<#WT HR< B<JNJ JHWH LHK<JSW00

clause (Adju-InfC) BLKTK 
Exodus 4:21	WJ>MR JHWH >L&M#H BLKTK L#WB MYRJMH R>H KL&HMPTJM >#R&#MTJ BJDK W<#JTM LPNJ PR<H W>NJ >XZQ >T&LBW WL> J#LX >T&H<M00



In [34]:
def check_h(vl, show_results=False):
    hl = {}
    total = 0
    for w in F.otype.s('word'):
        if F.sp.v(w) != 'verb' or F.lex.v(w).rstrip('[=/') != vl: continue
        total += 1
        c = L.u('clause', w)
        ps = L.d('phrase', c)
        phs = {p for p in ps if len({w for w in L.d('word', p) if F.uvf.v(w) == 'H'}) > 0}
        for f in ('Cmpl', 'Adju', 'Loca'):
            phc = {p for p in ps if pf_corr.get(p, None) or (pf_corr.get(p, F.function.v(p))) == f}
            if len(phc & phs): hl.setdefault(f, set()).add(w)
    for f in hl:
        inf('Verb {}: {} occurrences. He locales in {} phrases: {}'.format(vl, total, f, len(hl[f])), withtime=False)
        if show_results: inf('\t{}'.format(', '.join(str(x) for x in hl[f])), withtime=False)
check_h('BW>', show_results=True)        

Verb BW>: 2570 occurrences. He locales in Cmpl phrases: 157
	26118, 26127, 146447, 187920, 197138, 272406, 95257, 184350, 398368, 289826, 201253, 24616, 78897, 401459, 100410, 32829, 100413, 198208, 5698, 200258, 100938, 24653, 141902, 112207, 186960, 24658, 196690, 28764, 34400, 298594, 248931, 132198, 162918, 12402, 5747, 146044, 396927, 153216, 134792, 151176, 188042, 97419, 426120, 257165, 136338, 21656, 162970, 200349, 214687, 24740, 257192, 158378, 100527, 25777, 160434, 214707, 4789, 4793, 272569, 139963, 90812, 249020, 38595, 113861, 138448, 8920, 282841, 19166, 20703, 26850, 43235, 145127, 8424, 8937, 170729, 397032, 254703, 154354, 200948, 426230, 176376, 79609, 165626, 206075, 208636, 27391, 269569, 106246, 157447, 26380, 149785, 170782, 211232, 126758, 26414, 27438, 246062, 109363, 172340, 249140, 398134, 64828, 26431, 16704, 4929, 168771, 154964, 132955, 393569, 47460, 157541, 47466, 100206, 37232, 269170, 23415, 410999, 23933, 24448, 78208, 133518, 25999, 191381, 12698, 1

It would be handy to generate an informational spreadsheet that shows all these cases.

## 5.1 Process the enrichments

We read the enrichments, perform some consistency checks, and produce an annotation package.
If the filled-in sheet does not exist, we take the blank sheet, with the default assignment of the new features.
If a phrase got conflicting features, because it occurs in sheets for multiple verbs, the values in the filled-in sheet take precedence over the values in the blank sheet. If both occur in a filled in sheet, a warning will be issued.

In [35]:
objects_seen = collections.defaultdict(collections.Counter)

def read_enrich(rootdir): # rootdir will not be used, data is computed from sheets
    of_enriched = {
        False: {}, # for enrichments found in blank sheets
        True: {}, # for enrichments found in filled sheets
    }
    repeated = {
        False: collections.defaultdict(list), # for blank sheets
        True: collections.defaultdict(list), # for filled sheets
    }
    wrong_value = {
        False: collections.defaultdict(list),
        True: collections.defaultdict(list),
    }

    non_match = collections.defaultdict(list)
    wrong_node = collections.defaultdict(list)

    results = []
    dev_results = [] # results that deviate from the filled sheet
    
    ERR_LIMIT = 10

    for verb in sorted(verbs):
        vresults = {
            False: {}, # for blank sheets
            True: {}, # for filled sheets
        }
        for check in (
            (False, 'blank'), 
            (True, 'filled'),
        ):
            is_filled = check[0]
            filename = vfile(verb, 'enrich_{}'.format(check[1]))
            if not os.path.exists(filename):
                msg('NO {} enrichments file {}'.format(check[1], filename))
                continue
            #inf('READING {} enrichments file {}'.format(check[1], filename))

            with open(filename) as fh:
                header = fh.__next__()
                for line in fh:
                    fields = line.rstrip().split(';')
                    on = int(fields[2])
                    if on < 0: continue
                    kind = fields[10]
                    objects_seen[kind][on] += 1
                    vvals = tuple(fields[-nef:])
                    for (f, v) in zip(enrich_fields, vvals):
                        if v != '' and v != 'NA' and v not in enrich_fields[f]:
                            wrong_value[is_filled][on].append((verb, f, v))
                    vresults[is_filled][on] = vvals
                    if on in of_enriched[is_filled]:
                        if on not in repeated[is_filled]:
                            repeated[is_filled][on] = [of_enriched[is_filled][on]]
                        repeated[is_filled][on].append((verb, vvals))
                    else:
                        of_enriched[is_filled][on] = (verb, vvals)
                    if F.otype.v(on) != kind: 
                        non_match[on].append((verb, kind))
            for on in sorted(vresults[True]):          # check whether the phrase ids are not mangled
                if on not in vresults[False]:
                    wrong_node[on].append(verb)
            for on in sorted(vresults[False]):      # now collect all results, give precedence to filled values
                if F.otype.v(on) == 'phrase':
                    f_corr = on in pf_corr  # manual correction in phrase function
                    f_good = pf_corr.get(on, F.function.v(on)) 
                    s_manual = on in vresults[True] and vresults[False][on] != vresults[True][on] # real change
                else:
                    f_corr = ''
                    f_good = ''
                    s_manual = ''
                these_results = vresults[True][on] if s_manual else vresults[False][on]
                if f_corr or s_manual:
                    dev_results.append((on,)+these_results+(f_good, f_corr, s_manual))
                results.append((on,)+these_results+(f_good, f_corr, s_manual))

    for check in (
        (False, 'blank'), 
        (True, 'filled'),
    ):
        if len(wrong_value[check[0]]): #illegal values in sheets
            wrongs = wrong_value[check[0]]
            for x in sorted(wrongs)[0:ERR_LIMIT]:
                px = T.words(L.d('word', x), fmt='ev')
                ref_node = L.u('clause', x) if F.otype.v(x) != 'clause' else x
                cx = T.words(L.d('word', ref_node), fmt='ev')
                passage = T.passage(x)
                msg('ERROR: {} Illegal value(s) in {}: {} = {} in {}:'.format(
                    passage, check[1], x, px, cx
                ), withtime=False)
                for (verb, f, v) in wrongs[x]:
                    msg('\t"{}" is an illegal value for "{}" in verb {}'.format(
                        v, f, verb,
                    ), withtime=False)
            ne = len(wrongs)
            if ne > ERR_LIMIT: msg('... AND {} CASES MORE'.format(ne - ERR_LIMIT), withtime=False)
        else:
            inf('OK: The used {} enrichment sheets have legal values'.format(check[1]))

        nerrors = 0
        if len(repeated[check[0]]): # duplicates in sheets, check consistency
            repeats = repeated[check[0]]
            for x in sorted(repeats):
                overview = collections.defaultdict(list)
                for y in repeats[x]: overview[y[1]].append(y[0])
                px = T.words(L.d('word', x), fmt='ev')
                ref_node = L.u('clause', x) if F.otype.v(x) != 'clause' else x
                cx = T.words(L.d('word', ref_node), fmt='ev')
                passage = T.passage(x)
                if len(overview) > 1:
                    nerrors += 1
                    if nerrors < ERR_LIMIT:
                        msg('ERROR: {} Conflict in {}: {} = {} in {}:'.format(
                            passage, check[1], x, px, cx
                        ), withtime=False)
                        for vals in overview:
                            msg('\t{:<40} in verb(s) {}'.format(
                                ', '.join(vals),
                                ', '.join(overview[vals]),
                        ), withtime=False)
                elif False: # for debugging purposes
                #else:
                    nerrors += 1
                    if nerrors < ERR_LIMIT:
                        inf('{} Agreement in {} {} = {} in {}: {}'.format(
                            passage, check[1], x, px, cx, ','.join(list(overview.values())[0]),
                        ), withtime=False)
            ne = nerrors
            if ne > ERR_LIMIT: msg('... AND {} CASES MORE'.format(ne - ERR_LIMIT), withtime=False)
        if nerrors == 0:
            inf('OK: The used {} enrichment sheets are consistent'.format(check[1]))

    if len(non_match):
        msg('ERROR: Enrichments have been applied to nodes with non-matching types:')
        for x in sorted(non_match)[0:ERR_LIMIT]:
            (verb, shouldbe) = non_match[x]
            px = T.words(L.d('word', x), fmt='ev')
            msg('{}: {} Node {} is not a {} but a {}'.format(
                verb, T.passage(x), x, shouldbe, F.otype.v(x),
            ), withtime=False)
        ne = len(non_phrase)
        if ne > ERR_LIMIT: msg('... AND {} CASES MORE'.format(ne - ERR_LIMIT), withtime=False)
    else:
        inf('OK: all enriched nodes where phrase nodes')

    if len(wrong_node):
        msg('ERROR: Node in filled sheet did not occur in blank sheet:')
        for x in sorted(wrong_node)[0:ERR_LIMIT]:
            px = T.words(L.d('word', x), fmt='ev')
            msg('{}: {} node {}'.format(
                wrong_node[x], T.passage(x), x,
            ), withtime=False)
        ne = len(wrong_node)
        if ne > ERR_LIMIT: msg('... AND {} CASES MORE'.format(ne - ERR_LIMIT), withtime=False)
    else:
        inf('OK: all enriched nodes occurred in the blank sheet')

    if len(dev_results):
        inf('OK: there are {} manual correction/enrichment annotations'.format(len(dev_results)))
        for r in dev_results[0:ERR_LIMIT]:
            (x, *vals, f_good, f_corr, s_manual) = r
            px = T.words(L.d('word', x), fmt='ev')
            cx = T.words(L.d('word', L.u('clause', x)), fmt='ev')
            inf('{:<30} {:>7} => {:<3} {:<3} {}\n\t{}\n\t\t{}'.format(
                'COR' if f_corr else '',
                'MAN' if s_manual else'',
                T.passage(x), x, ','.join(vals), px, cx
            ), withtime=False)
        ne = len(dev_results)
        if ne > ERR_LIMIT: inf('... AND {} ANNOTATIONS MORE'.format(ne - ERR_LIMIT), withtime=False)
    else:
        msg('WARNING: there are no manual correction/enrichment annotations')
    return results

corr = ExtraData(API)
corr.deliver_annots(
    'complements', 
    {'title': 'Verb complement enrichments', 'date': '2016-06'},
    [
        (None, 'complements', read_enrich, tuple(
            ('JanetDyk', 'ft', fname) for fname in list(enrich_fields.keys())+['function', 'f_correction', 's_manual']
        ))
    ],
)

stats = collections.Counter()
for (p, times) in phrases_seen.items(): stats[times] += 1
for (times, n) in sorted(stats.items(), key=lambda y: (-y[1], y[0])):
    inf('{:<6} phrases seen {:<2} time(s)'.format(n, times))
inf('Total phrases seen: {}'.format(len(phrases_seen)))

    17s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/oBR_etcbc4b.csv
    17s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/oFH_etcbc4b.csv
    17s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/oLH_etcbc4b.csv
    17s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/BRa_etcbc4b.csv
    17s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/BWa_etcbc4b.csv
    17s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/CJT_etcbc4b.csv
    17s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/CWB_etcbc4b.csv
    18s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/HLK_etcbc4b.csv
    18s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/JRD_etcbc4b.csv
    18s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/JYa_etcbc4b.csv
    18s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/NFa_

    18s OK: The used blank enrichment sheets have legal values
    18s OK: The used blank enrichment sheets are consistent
    18s OK: The used filled enrichment sheets have legal values
    18s OK: The used filled enrichment sheets are consistent
    18s OK: all enriched nodes where phrase nodes
    18s OK: all enriched nodes occurred in the blank sheet
    18s OK: there are 1878 manual correction/enrichment annotations
COR                                    => Genesis 2:18 605699 adjunct,NA,NA,,,
	L.OW 
		>E<:EFEH.&L.OW <;ZER K.:NEG:D.OW00

COR                                    => Genesis 3:7 605887 adjunct,NA,NA,,,
	L@HEm 
		WAJ.A<:AFW. L@HEm X:AGOROT00

COR                                    => Genesis 3:21 606057 adjunct,NA,NA,,,
	L:>@D@m W.L:>IC:T.OW 
		WAJ.A<AF J:HW@H >:ELOHIJm L:>@D@m W.L:>IC:T.OW K.@T:NOWT <OWR 
COR                                    => Genesis 6:14 606814 adjunct,NA,NA,,,
	L:k@ 
		<:AF;H L:k@ T.;BAT <:AY;J&GOPER 
COR                                    => Gen

# 6 Annox complements
We load the new and modified features into the LAF-Fabric API, in the process of which they will be compiled.

Note that we draw in the new annotations by specifying an *annox* called `complements` (the second argument of the `fabric.load` function).

Then we turn that data into LAF annotations. Every enrichment is stored in new features, 
with names specified above in ``enrich_fields``, 
with label `ft` and namespace `JanetDyk`.

In [36]:
API=fabric.load(source+version, 'complements', 'flow_corr', {
    "xmlids": {"node": False, "edge": False},
    "features": ('''
        oid otype
        sp vs lex
        rela typ
        chapter verse
        etcbc4:ft.function JanetDyk:ft.function
        s_manual f_correction
    ''' + ' '.join(enrich_fields),
    '''
    '''),
    "prepare": prepare,
}, verbose='DETAIL')
exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.00s DETAIL: COMPILING m: etcbc4b: UP TO DATE
  0.00s USING main: etcbc4b DATA COMPILED AT: 2015-11-02T15-08-56
  0.00s BEGIN COMPILE a: complements
  0.00s DETAIL: load main: X. [node]  -> 
  1.16s DETAIL: load main: X. [e]  -> 
  2.87s DETAIL: load main: G.node_anchor_min
  2.92s DETAIL: load main: G.node_anchor_max
  2.96s DETAIL: load main: G.node_sort
  3.00s DETAIL: load main: G.node_sort_inv
  3.39s DETAIL: load main: G.edges_from
  3.45s DETAIL: load main: G.edges_to
  3.50s LOGFILE=/Users/dirk/laf/laf-fabric-data/etcbc4b/bin/A/complements/__log__compile__.txt
  3.50s PARSING ANNOTATION FILES
  3.52s INFO: parsing complements.xml
  6.90s INFO: END PARSING
         0 good   regions  and     0 faulty ones
         0 linked nodes    and     0 unlinked ones
         0 good   edges    and     0 faulty ones
     53450 good   annots   and     0 faulty ones
    481050 good   features and     0 faulty ones
     53450 distinct xml identifiers

  6

## Simple test
Take the first 10 phrases and retrieve the corrected and uncorrected function feature.
Note that the corrected function feature is only filled in, if it occurs in a clause in which a selected verb occurs.

In [37]:
for i in list(F.otype.s('phrase'))[0:10]: 
    print('{} - {} - {}'.format(
        F.function.v(i), 
        F.JanetDyk_ft_function.v(i),
        L.u('clause', i) in clause_verb,
    ))

Time - Time - True
Pred - Pred - True
Subj - Subj - True
Objc - Objc - True
Conj - None - False
Subj - None - False
Pred - None - False
PreC - None - False
Conj - None - False
Subj - None - False


## Results

We put all corrections and enrichments in a single csv file for checking.

In [38]:
f = open(all_results, 'w')
NALLFIELDS = 17
tpl = ('{};' * (NALLFIELDS - 1))+'{}\n'

inf('collecting constituents ...')
f.write(tpl.format(
    '-',
    '-',
    'passage',
    'verb(s) text',
    '-',
    '-',
    '-',
    '-',
    '-',
    '-',
    '-',
    '-',
    '-',
    '-',
    '-',
    '-',
    'clause text',
    'clause node',
))
f.write(tpl.format(
    'corrected',
    'enriched',
    'passage',
    '-',
    'object type',
    'clause rela',
    'clause type',
    'phrase function (old)',
    'phrase function (new)',
    'phrase type',
    'valence',
    'predication',
    'grammatical',
    'original',
    'lexical',
    'semantic',
    'object text',
    'object node',
))
i = 0
j = 0
c = 0
CHUNK_SIZE = 10000
for cn in sorted(clause_verb):
    c += 1
    vrbs = sorted(clause_verb[cn])
    f.write(tpl.format(
        '',
        '',
        T.passage(cn),
        ' '.join(F.lex.v(verb) for verb in vrbs),
        '',
        '',
        '',
        '',
        '',
        '',
        '',
        '',
        '',
        '',
        '',
        '',
        T.words(L.d('word', cn), fmt='ec').replace('\n', ' '),
        cn,
    ))
    for pn in L.d('phrase', cn):
        i += 1
        j += 1
        if j == CHUNK_SIZE:
            j = 0
            inf('{:>6} constituents in {:>5} clauses ...'.format(i, c))
        f.write(tpl.format(
            'COR' if F.f_correction.v(pn) == 'True' else '',
            'MAN' if F.s_manual.v(pn) == 'True' else '',
            T.passage(pn),
            '',
            'phrase',
            '',
            '',
            F.etcbc4_ft_function.v(pn),
            F.JanetDyk_ft_function.v(pn),
            F.typ.v(pn),
            F.valence.v(pn),
            F.predication.v(pn),
            F.grammatical.v(pn),
            F.original.v(pn),
            F.lexical.v(pn),
            F.semantic.v(pn),
            T.words(L.d('word', pn), fmt='ec').replace('\n', ' '),
            pn,
        ))
    for scn in clause_objects[cn]:
        i += 1
        j += 1
        if j == CHUNK_SIZE:
            j = 0
            inf('{:>6} constituents in {:>5} clauses ...'.format(i, c))
        f.write(tpl.format(
            '',
            '',
            T.passage(scn),
            '',
            'clause',
            F.rela.v(scn),
            F.typ.v(scn),
            '',
            '',
            '',
            F.valence.v(scn),
            F.predication.v(scn),
            F.grammatical.v(scn),
            F.original.v(scn),
            F.lexical.v(scn),
            F.semantic.v(scn),
            T.words(L.d('word', scn), fmt='ec').replace('\n', ' '),
            scn,
        ))

f.close()
inf('{:>6} constituents in {:>5} clauses done'.format(i, c))

  1.64s collecting constituents ...
  2.49s  10000 constituents in  2823 clauses ...
  3.32s  20000 constituents in  5707 clauses ...
  4.10s  30000 constituents in  8713 clauses ...
  4.93s  40000 constituents in 11768 clauses ...
  5.75s  50000 constituents in 14914 clauses ...
  6.05s  53450 constituents in 15880 clauses done
