<a href="http://www.godgeleerdheid.vu.nl/etcbc" target="_blank"><img align="left" src="images/VU-ETCBC-small.png"/></a>
<a href="http://laf-fabric.readthedocs.org/en/latest/" target="_blank"><img align="right" src="images/laf-fabric-small.png"/></a>
<a href="http://www.persistent-identifier.nl/?identifier=urn%3Anbn%3Anl%3Aui%3A13-048i-71" target="_blank"><img align="right"src="images/DANS-small.png"/></a>

# Complement corrections


# 0. Introduction

Joint work of Dirk Roorda and Janet Dyk.

In order to do
[flowchart analysis](https://shebanq.ancient-data.org/shebanq/static/docs/tools/valence/flowchart.html)
on verbs, we need to correct some coding errors.

Because the flowchart assigns meanings to verbs depending on the number and nature of complements found in their context, it is important that the phrases in those clauses are labeled correctly, i.e. that the
[function](https://shebanq.ancient-data.org/shebanq/static/docs/featuredoc/features/comments/function.html)
feature for those phrases have the correct label.

# References

(Janet Dyk, Reinoud Oosting and Oliver Glanz, 2014) 
Analysing Valence Patterns in Biblical Hebrew: Theoretical Questions and Analytic Frameworks.
*J. of Northwest Semitic Languages, vol. 40 (2014), no. 1, pp. 43-62*.
[pdf abstract](http://academic.sun.ac.za/jnsl/Volumes/JNSL%2040%201%20abstracts%20and%20bookreview.pdf)
[pdf fulltext (author's copy with deviant page numbering)](https://shebanq.ancient-data.org/static/docs/methods/2014_Dyk_jnsl.pdf)

(Janet Dyk 2014)
Deportation or Forgiveness in Hosea 1.6? Verb Valence Patterns and Translation Proposals.
*The Bible Translator 2014, Vol. 65(3) 235–279*.
[pdf](http://tbt.sagepub.com/content/65/3/235.full.pdf?ijkey=VK2CEHvVrvSGA5B&keytype=finite)

(Janet Dyk 014)
Traces of Valence Shift in Classical Hebrew.
In: *Discourse, Dialogue, and Debate in the Bible: Essays in Honour of Frank Polak*.
Ed. Athalya Brenner-Idan.
*Sheffield Pheonix Press, 48–65*.
[book behind pay-wall](http://www.sheffieldphoenix.com/showbook.asp?bkid=273)

# 1. Task
In this notebook we do the following tasks:

* generate correction sheets for selected verbs,
* process the set of filled in correction sheets
* generate sheets with computed, new features (based on corrected values, valence related) to be edited manually
* transform the set of filled in enrichment sheets into an annotation package

Between the first and second task, the sheets will have been filled in by Janet with corrections.
Between the third and the fourth task, the sheets will be inspected and improved by Janet.

The resulting annotation package offers the corrections as the value of a new feature, also called `function`, but now in the annotation space `JanetDyk` instead of `etcbc4`.
The results of the enrichment will be added as new features in that same annotation space.

## 1.1 Limitations
We restrict ourselves to verb occurrences where the verb is the nucleus of a phrase with function *predicate*. 
There are also verb occurrences in other kinds of phrases, and these also can have complements. These cases are coded very differently in the database. See for example [Joshua 3:8](https://shebanq.ancient-data.org/hebrew/text?book=Josua&chapter=3&verse=8&version=4b&mr=m&qw=q&tp=txt_tb1&tr=hb&wget=v&qget=v&nget=v). (*and you command the priest carrying the ark* ...).

# 2. Implementation

Start the engines, and note the import of the `ExtraData` functionality from the `etcbc.extra` module.
This module can turn data with anchors into additional LAF annotations to the big ETCBC LAF resource.

In [1]:
import sys,os, collections
from copy import deepcopy

import laf
from laf.fabric import LafFabric
from etcbc.preprocess import prepare
from etcbc.extra import ExtraData

fabric = LafFabric()

  0.00s This is LAF-Fabric 4.7.2
API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html
Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html



In [2]:
source = 'etcbc'
version = '4b'

We instruct the API to load data.
Note that we ask for the XML identifiers, because `ExtraData` needs them to stitch the corrections into the LAF XML.

In [3]:
API = fabric.load(source+version, 'lexicon', 'flow_corr', {
    "xmlids": {"node": True, "edge": False},
    "features": ('''
        oid otype
        sp vs lex uvf prs nametype ls
        function rela
        chapter verse
    ''','''
        mother
    '''),
    "prepare": prepare,
}, verbose='DETAIL')
exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.00s DETAIL: COMPILING m: etcbc4b: UP TO DATE
  0.00s USING main: etcbc4b DATA COMPILED AT: 2015-11-02T15-08-56
  0.00s DETAIL: COMPILING a: lexicon: UP TO DATE
  0.00s USING annox: lexicon DATA COMPILED AT: 2016-07-08T14-32-54
  0.01s DETAIL: load main: G.node_anchor_min
  0.13s DETAIL: load main: G.node_anchor_max
  0.23s DETAIL: load main: G.node_sort
  0.36s DETAIL: load main: G.node_sort_inv
  0.87s DETAIL: load main: G.edges_from
  0.99s DETAIL: load main: G.edges_to
  1.12s DETAIL: load main: X. [node]  -> 
  2.49s DETAIL: load main: X. [node]  <- 
  3.37s DETAIL: load main: F.etcbc4_db_oid [node] 
  4.21s DETAIL: load main: F.etcbc4_db_otype [node] 
  5.35s DETAIL: load main: F.etcbc4_ft_function [node] 
  5.54s DETAIL: load main: F.etcbc4_ft_lex [node] 
  5.76s DETAIL: load main: F.etcbc4_ft_ls [node] 
  5.99s DETAIL: load main: F.etcbc4_ft_prs [node] 
  6.21s DETAIL: load main: F.etcbc4_ft_rela [node] 
  6.57s DETAIL: load main: F.etcb

# 2.1 Locations

In [42]:
ln_base = 'https://shebanq.ancient-data.org/hebrew/text'
ln_tpl = '?book={}&chapter={}&verse={}'
ln_tweak = '&version=4b&mr=m&qw=n&tp=txt_tb1&tr=hb&wget=x&qget=v&nget=x'

home_dir = os.path.expanduser('~').replace('\\', '/')
base_dir = '{}/Dropbox/SYNVAR'.format(home_dir)
result_dir = '{}/results'.format(base_dir)
all_results = '{}/all.csv'.format(result_dir)
kinds = ('corr_blank', 'corr_filled', 'enrich_blank', 'enrich_filled')
kdir = {}
for k in kinds:
    kd = '{}/{}'.format(base_dir, k)
    kdir[k] = kd
    if not os.path.exists(kd):
        os.makedirs(kd)
if not os.path.exists(result_dir):
    os.makedirs(result_dir)


def vfile(verb, kind):
    if kind not in kinds:
        msg('Unknown kind `{}`'.format(kind))
        return None
    return '{}/{}_{}{}.csv'.format(kdir[kind], verb.replace('>','a').replace('<', 'o'), source, version)

# 2.2 Domain
Here is the set of verbs that interest us.

In [43]:
verbs_initial = set('''
    CJT
    BR>
    QR>
'''.strip().split())

motion_verbs = set('''
    <BR
    <LH
    BW>
    CWB
    HLK
    JRD
    JY>
    NPL
    NWS
    SWR
'''.strip().split())

double_object_verbs = set('''
    NTN
    <FH
    FJM
'''.strip().split())

complex_qal_verbs = set('''
    NF>
    PQD
'''.strip().split())

verbs = verbs_initial | motion_verbs | double_object_verbs | complex_qal_verbs

# 2.3 Phrase function

We need to correct some values of the phrase function.
When we receive the corrections, we check whether they have legal values.
Here we look up the possible values.

In [44]:
predicate_functions = {
    'Pred', 'PreS', 'PreO', 'PreC', 'PtcO',
}

In [45]:
legal_values = dict(
    function={F.function.v(p) for p in F.otype.s('phrase')},
)

We generate a list of occurrences of those verbs, organized by the lexeme of the verb.
We need some extra values, to indicate other coding errors.

In [46]:
error_values = dict(
    function=dict(
        BoundErr='this phrase is part of another phrase and does not merit its own function value',
    ),
)

We add the error_values to the legal values.

In [47]:
for feature in set(legal_values.keys()) | set(error_values.keys()):
    ev = error_values.get(feature, {})
    if ev:
        lv = legal_values.setdefault(feature, set())
        lv |= set(ev.keys())
inf('{}'.format(legal_values))

21m 19s {'function': {'IntS', 'Adju', 'Supp', 'ModS', 'NCop', 'Modi', 'Intj', 'PreC', 'PtcO', 'Loca', 'Nega', 'Frnt', 'Ques', 'ExsS', 'NCoS', 'Pred', 'PreS', 'Objc', 'Exst', 'PrAd', 'PreO', 'PrcS', 'Subj', 'BoundErr', 'Cmpl', 'Conj', 'Rela', 'Time', 'EPPr', 'Voct'}}


In [48]:
inf('Finding occurrences ...')
occs = collections.defaultdict(list)   # dictionary of all verb occurrence nodes per verb lexeme
npoccs = collections.defaultdict(list) # same, but those not occurring in a "predicate"
clause_verb = collections.defaultdict(list)    # dictionary of all verb occurrence nodes per clause node
clause_verb_selected = collections.defaultdict(list) # idem but for the occurrences of selected verbs

nw = 0
nws = 0
for w in F.otype.s('word'):
    if F.sp.v(w) != 'verb': continue
    nw += 1
    lex = F.lex.v(w).rstrip('/=[')
    pf = F.function.v(L.u('phrase', w))
    if pf not in predicate_functions:
        npoccs[lex].append(w)
    occs[lex].append(w)
    cn = L.u('clause', w)
    clause_verb[cn].append(w)
    if lex in verbs:
        nws += 1
        clause_verb_selected[cn].append(w)

inf('Done')
inf('Total:    {:>6} verb occurrences in {} clauses'.format(nw, len(clause_verb)), withtime=False)
inf('Selected: {:>6} verb occurrences in {} clauses'.format(nws, len(clause_verb_selected)), withtime=False)

for verb in sorted(verbs):
    inf('{} {:>5} occurrences of which {:>4} outside a predicate phrase'.format(
        verb, 
        len(occs[verb]),
        len(npoccs[verb]),
        withtime=False,
    ))

21m 21s Finding occurrences ...
21m 25s Done
Total:     73679 verb occurrences in 70131 clauses
Selected:  16209 verb occurrences in 16053 clauses
21m 25s <BR   556 occurrences of which   33 outside a predicate phrase
21m 25s <FH  2629 occurrences of which   59 outside a predicate phrase
21m 25s <LH   890 occurrences of which   10 outside a predicate phrase
21m 25s BR>    54 occurrences of which    3 outside a predicate phrase
21m 25s BW>  2570 occurrences of which   27 outside a predicate phrase
21m 25s CJT    85 occurrences of which    1 outside a predicate phrase
21m 25s CWB  1056 occurrences of which   22 outside a predicate phrase
21m 25s FJM   609 occurrences of which    3 outside a predicate phrase
21m 25s HLK  1554 occurrences of which   30 outside a predicate phrase
21m 25s JRD   377 occurrences of which   16 outside a predicate phrase
21m 25s JY>  1069 occurrences of which   32 outside a predicate phrase
21m 25s NF>   656 occurrences of which   52 outside a predicate phrase
2

# 3 Blank sheet generation
Generate correction sheets.
They are CSV files. Every row corresponds to a verb occurrence.
The fields per row are the node numbers of the clause in which the verb occurs, the node number of the verb occurrence, the text of the verb occurrence (in ETCBC transliteration, consonantal) a passage label (book, chapter, verse), and then 4 columns for each phrase in the clause:

* phrase node number
* phrase text (ETCBC translit consonantal)
* original value of the `function` feature
* corrected value of the `function` feature (generated as empty)

In [49]:
phrases_seen = collections.Counter()

def gen_sheet(verb):
    rows = []
    fieldsep = ';'
    field_names = '''
        clause#
        word#
        passage
        link
        verb
        stem
    '''.strip().split()
    max_phrases = 0
    clauses_seen = set()
    for wn in occs[verb]:
        cln = L.u('clause', wn)
        if cln in clauses_seen: continue
        clauses_seen.add(cln)
        vn = L.u('verse', wn)
        bn = L.u('book', wn)
        ch = F.chapter.v(vn)
        vs = F.verse.v(vn)
        passage_label = T.passage(vn)
        ln = ln_base+(ln_tpl.format(T.book_name(bn, lang='la'), ch, vs))+ln_tweak
        lnx = '''"=HYPERLINK(""{}""; ""link"")"'''.format(ln)
        vt = T.words([wn], fmt='ec').replace('\n', '')
        vstem = F.vs.v(wn)
        np = '* ' if wn in npoccs[verb] else ''
        row = [cln, wn, passage_label, lnx, np+vt, vstem]
        phrases = L.d('phrase', cln)
        n_phrases = len(phrases)
        if n_phrases > max_phrases: max_phrases = n_phrases
        for pn in phrases:
            phrases_seen[pn] += 1
            pt = T.words(L.d('word', pn), fmt='ec').replace('\n', '')
            pf = F.function.v(pn)
            pnp = np if pf in predicate_functions else ''
            row.extend((pn, pnp+pt, pf, ''))
        rows.append(row)
    for i in range(max_phrases):
        field_names.extend('''
            phr{i}#
            phr{i}_txt
            phr{i}_function
            phr{i}_corr
        '''.format(i=i+1).strip().split())
    filename = vfile(verb, 'corr_blank')
    row_file = open(filename, 'w')
    row_file.write('{}\n'.format(fieldsep.join(field_names)))
    for row in rows:
        row_file.write('{}\n'.format(fieldsep.join(str(x) for x in row)))
    row_file.close()
    inf('Generated correction sheet for verb {}'.format(filename))
    
for verb in verbs: gen_sheet(verb)
    
stats = collections.Counter()
for (p, times) in phrases_seen.items(): stats[times] += 1
for (times, n) in sorted(stats.items(), key=lambda y: (-y[1], y[0])):
    inf('{:<6} phrases seen {:<2} time(s)'.format(n, times))
inf('Total phrases seen: {}'.format(len(phrases_seen)))

21m 28s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/FJM_etcbc4b.csv
21m 28s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/JRD_etcbc4b.csv
21m 29s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/oFH_etcbc4b.csv
21m 29s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/NPL_etcbc4b.csv
21m 29s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/CJT_etcbc4b.csv
21m 29s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/oLH_etcbc4b.csv
21m 29s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/NWS_etcbc4b.csv
21m 29s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/QRa_etcbc4b.csv
21m 30s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/PQD_etcbc4b.csv
21m 30s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/SWR_etcbc4b.csv
21m 30s Generated co

# 4 Processing corrections
We read the filled-in correction sheets and extract the correction data out of it.
We store the corrections in a dictionary keyed by the phrase node.
We check whether we get multiple corrections for the same phrase.

In [50]:
phrases_seen = collections.Counter()
pf_corr = {}

def read_corr():
    function_values = legal_values['function']

    for verb in sorted(verbs):
        repeated = collections.defaultdict(list)
        non_phrase = set()
        illegal_fvalue = set()

        filename = vfile(verb, 'corr_filled')
        if not os.path.exists(filename):
            msg('NO file {}'.format(filename))
            continue
        else:
            inf('Processing {}'.format(filename))
        with open(filename) as f:
            header = f.__next__()
            for line in f:
                fields = line.rstrip().split(';')
                for i in range(1, len(fields)//4):
                    (pn, pc) = (fields[2+4*i], fields[2+4*i+3])
                    if pn != '':
                        pc = pc.strip()
                        pn = int(pn)
                        phrases_seen[pn] += 1
                        if pc != '':
                            good = True
                            for i in [1]:
                                good = False
                                if pn in pf_corr:
                                    repeated[pn] += pc
                                    continue
                                if pc not in function_values:
                                    illegal_fvalue.add(pc)
                                    continue
                                if F.otype.v(pn) != 'phrase': 
                                    non_phrase.add(pn)
                                    continue
                                good = True
                            if good:
                                pf_corr[pn] = pc

        inf('{}: Found {:>5} corrections in {}'.format(verb, len(pf_corr), filename))
        if len(repeated):
            msg('ERROR: Some phrases have been corrected multiple times!')
            for x in sorted(repeated):
                msg('{:>6}: {}'.format(x, ', '.join(repeated[x])))
        else:
            inf('OK: Corrected phrases did not receive multiple corrections')
        if len(non_phrase):
            msg('ERROR: Corrections have been applied to non-phrase nodes: {}'.format(','.join(non_phrase)))
        else:
            inf('OK: all corrected nodes where phrase nodes')
        if len(illegal_fvalue):
            msg('ERROR: Some corrections supply illegal values for phrase function!')
            msg('`{}`'.format('`, `'.join(illegal_fvalue)))
        else:
            inf('OK: all corrected values are legal')
    inf('Found {} corrections in the phrase function'.format(len(pf_corr)))
        
read_corr()

stats = collections.Counter()
for (p, times) in phrases_seen.items(): stats[times] += 1
for (times, n) in sorted(stats.items(), key=lambda y: (-y[1], y[0])):
    inf('{:<6} phrases seen {:<2} time(s)'.format(n, times))
inf('Total phrases seen: {}'.format(len(phrases_seen)))

21m 34s NO file /Users/dirk/Dropbox/SYNVAR/corr_filled/oBR_etcbc4b.csv
21m 34s NO file /Users/dirk/Dropbox/SYNVAR/corr_filled/oFH_etcbc4b.csv
21m 34s NO file /Users/dirk/Dropbox/SYNVAR/corr_filled/oLH_etcbc4b.csv


21m 34s Processing /Users/dirk/Dropbox/SYNVAR/corr_filled/BRa_etcbc4b.csv
21m 34s BR>: Found     4 corrections in /Users/dirk/Dropbox/SYNVAR/corr_filled/BRa_etcbc4b.csv
21m 34s OK: Corrected phrases did not receive multiple corrections
21m 34s OK: all corrected nodes where phrase nodes
21m 34s OK: all corrected values are legal
21m 34s Processing /Users/dirk/Dropbox/SYNVAR/corr_filled/BWa_etcbc4b.csv
21m 34s BW>: Found    59 corrections in /Users/dirk/Dropbox/SYNVAR/corr_filled/BWa_etcbc4b.csv
21m 34s OK: Corrected phrases did not receive multiple corrections
21m 34s OK: all corrected nodes where phrase nodes
21m 34s OK: all corrected values are legal
21m 34s Processing /Users/dirk/Dropbox/SYNVAR/corr_filled/CJT_etcbc4b.csv
21m 34s CJT: Found    62 corrections in /Users/dirk/Dropbox/SYNVAR/corr_filled/CJT_etcbc4b.csv
21m 34s OK: Corrected phrases did not receive multiple corrections
21m 34s OK: all corrected nodes where phrase nodes
21m 34s OK: all corrected values are legal
21m 34s Pr

21m 34s NO file /Users/dirk/Dropbox/SYNVAR/corr_filled/JRD_etcbc4b.csv
21m 34s NO file /Users/dirk/Dropbox/SYNVAR/corr_filled/JYa_etcbc4b.csv


21m 34s Processing /Users/dirk/Dropbox/SYNVAR/corr_filled/NFa_etcbc4b.csv
21m 34s NF>: Found   465 corrections in /Users/dirk/Dropbox/SYNVAR/corr_filled/NFa_etcbc4b.csv
21m 34s OK: Corrected phrases did not receive multiple corrections
21m 34s OK: all corrected nodes where phrase nodes
21m 34s OK: all corrected values are legal


21m 34s NO file /Users/dirk/Dropbox/SYNVAR/corr_filled/NPL_etcbc4b.csv


21m 34s Processing /Users/dirk/Dropbox/SYNVAR/corr_filled/NTN_etcbc4b.csv
21m 34s NTN: Found   609 corrections in /Users/dirk/Dropbox/SYNVAR/corr_filled/NTN_etcbc4b.csv
21m 34s OK: Corrected phrases did not receive multiple corrections
21m 34s OK: all corrected nodes where phrase nodes
21m 34s OK: all corrected values are legal
21m 34s Processing /Users/dirk/Dropbox/SYNVAR/corr_filled/NWS_etcbc4b.csv
21m 34s NWS: Found   620 corrections in /Users/dirk/Dropbox/SYNVAR/corr_filled/NWS_etcbc4b.csv
21m 34s OK: Corrected phrases did not receive multiple corrections
21m 34s OK: all corrected nodes where phrase nodes
21m 34s OK: all corrected values are legal
21m 34s Processing /Users/dirk/Dropbox/SYNVAR/corr_filled/PQD_etcbc4b.csv
21m 34s PQD: Found   646 corrections in /Users/dirk/Dropbox/SYNVAR/corr_filled/PQD_etcbc4b.csv
21m 34s OK: Corrected phrases did not receive multiple corrections
21m 34s OK: all corrected nodes where phrase nodes
21m 34s OK: all corrected values are legal
21m 34s Pr

# 5. Enrichment

We create blank sheets for new feature assignments, based on the corrected data.

In [51]:
enrich_field_spec = '''
valence
    adjunct
    complement
    core

predication
    NA
    regular
    copula

grammatical
    NA
    subject
    principal_direct_object
    direct_object
    indirect_object
    *

original
    NA
    subject
    principal_direct_object
    direct_object
    indirect_object
    *

lexical
    location
    time

semantic
    benefactive
    time
    location
    instrument
    manner
'''
enrich_fields = collections.OrderedDict()
cur_e = None
for line in enrich_field_spec.strip().split('\n'):
    if line.startswith(' '):
        enrich_fields.setdefault(cur_e, set()).add(line.strip())
    else:
        cur_e = line.strip()
if None in enrich_fields:
    msg('Invalid enrich field specification')
else:
    inf('Enrich field specification OK')
for ef in enrich_fields:
    inf('{} = {{{}}}'.format(ef, ', '.join(sorted(enrich_fields[ef]))), withtime=False)
nef = len(enrich_fields)

21m 39s Enrich field specification OK
valence = {adjunct, complement, core}
predication = {NA, copula, regular}
grammatical = {*, NA, direct_object, indirect_object, principal_direct_object, subject}
original = {*, NA, direct_object, indirect_object, principal_direct_object, subject}
lexical = {location, time}
semantic = {benefactive, instrument, location, manner, time}


In [52]:
enrich_baseline_rules = '''
Adju	Adjunct	adjunct	NA	NA			
Cmpl	Complement	complement	NA	*			
Conj	Conjunction	NA	NA	NA		NA	NA
EPPr	Enclitic personal pronoun	NA	copula	NA			
ExsS	Existence with subject suffix	core	copula	subject			
Exst	Existence	core	copula	NA			
Frnt	Fronted element	NA	NA	NA		NA	NA
Intj	Interjection	NA	NA	NA		NA	NA
IntS	Interjection with subject suffix	core	NA	subject			
Loca	Locative	adjunct	NA	NA		location	location
Modi	Modifier	NA	NA	NA		NA	NA
ModS	Modifier with subject suffix	core	NA	subject			
NCop	Negative copula	core	copula	NA			
NCoS	Negative copula with subject suffix	core	copula	subject			
Nega	Negation	NA	NA	NA		NA	NA
Objc	Object	complement	NA	direct_object			
PrAd	Predicative adjunct	adjunct	NA	NA			
PrcS	Predicate complement with subject suffix	core	regular	subject			
PreC	Predicate complement	core	regular	NA			
Pred	Predicate	core	regular	NA			
PreO	Predicate with object suffix	core	regular	direct_object			
PreS	Predicate with subject suffix	core	regular	subject			
PtcO	Participle with object suffix	core	regular	direct_object			
Ques	Question	NA	NA	NA		NA	NA
Rela	Relative	NA	NA	NA		NA	NA
Subj	Subject	core	NA	subject			
Supp	Supplementary constituent	adjunct	NA	NA			benefactive
Time	Time reference	adjunct	NA	NA		time	time
Unkn	Unknown	NA	NA	NA		NA	NA
Voct	Vocative	NA	NA	NA		NA	NA'''.strip().split('\n')

In [53]:
transform = {}
for line in enrich_baseline_rules:
    x = line.split('\t')
    if len(x) - 2 != nef:
        msg('Wrong number of fields ({} must be {}) in {}'.format(len(x), nef, line))
    transform[x[0]] = dict(zip(enrich_fields, x[2:]))
for e in error_values['function']:
    transform[e] = dict(zip(enrich_fields, ['']*nef))

errors = 0
good = 0
for f in transform:
    for e in enrich_fields:
        val = transform[f][e]
        if val != '' and val != 'NA' and val not in enrich_fields[e]:
            msg('Defaults for `{}`: wrong `{}` value: "{}"'.format(f, e, val))
            errors += 1
        else: good += 1
if errors:
    msg('There were {} errors ({} good)'.format(errors, good))
else:
    inf('Enrich baseline rules are OK ({} good)'.format(good))

21m 42s Enrich baseline rules are OK (186 good)


Let us prettyprint the baseline rules of enrichment for easier reference.

In [54]:
ltpl = '{:<8}: '+('{:<15}' * nef)
inf(ltpl.format('func', *enrich_fields), withtime=False)
for f in sorted(transform):
    sfs = transform[f]
    inf(ltpl.format(f, *[sfs[sf] for sf in enrich_fields]), withtime=False)

func    : valence        predication    grammatical    original       lexical        semantic       
Adju    : adjunct        NA             NA                                                          
BoundErr:                                                                                           
Cmpl    : complement     NA             *                                                           
Conj    : NA             NA             NA                            NA             NA             
EPPr    : NA             copula         NA                                                          
ExsS    : core           copula         subject                                                     
Exst    : core           copula         NA                                                          
Frnt    : NA             NA             NA                            NA             NA             
IntS    : core           NA             subject                                            

## 5.1 Enrichment logic

For certain verbs and certain conditions, we can automatically fill in some of the new features.
For example, if the verb is `CJT`, and if an adjunct phrase is personal, starting with `L`, we know that the semantic role is *benefactive*.

We will also analyse the direct and indirect objects more precisely and implement heuristics to make a distinction between complements (locative) and indirect objects.

### Finding the direct objects

In the target clauses we will find the direct object(s).
If there is more than one, we will compute which is the principal one.
The others are secundary ones.
If there is only one direct object, we do not mark it as principal.

An object can be a phrase or a clause. 
We will not mark object phrases as principal direct objects, by rules stated later on.

### Implied objects

There are many cases where there is a direct object without it being marked as such in the data.
Those are cases where there are no objective, unambiguous signals for a direct object.
We call them *implied objects*. Examples: 

* the relativum in relative clauses
* complements starting with MN (from) or L (to)

In the case of implied objects we have to guess.
Initially we assume that there are no implied objects.

Later, when we inspect individual cases, we can mark principal objects and implied objects manually
for those cases where these rules do not suffice.

### Finding the principal direct object

When there are multiple direct objects, we use the rules formulated by (Janet Dyk, Reinoud Oosting and Oliver Glanz, 2014) to determine which one is the principal one. The rules are stated below where we make some remarks about how we apply them to our data.

#### Interpretation

When looking for principal direct objects, we restrict ourselves to direct objects at the phrase level, either being complete phrases, or pronominal suffixes within phrases. The following rules express a preference for the principal direct object. In a given context, we select the direct object that is preferred by applying those rules as the principal direct object. We only apply these rules if there are at least two direct objects.
If there is only one direct object, it is not marked as principal.

#### Rule 1: pronominal suffixes > preferred above marked objects > unmarked objects

In a given clause, we collect all phrases with function ``PreO`` or ``PtcO``. 
If this collection is non-empty, we pick the one that is textually first (by rule 3 below) and stop applying rules.
Otherwise, we proceed as follows.

We collect all the phrases with function ``Objc``.
If this collection is empty, there will not be a principal object.
Otherwise, we split it up in marked and unmarked object phrases.

An object phrase is *marked* if and only if it contains, somewhere, the object marker ``>T``.
If there are marked object phrases, we pick the one that is textually first (by rule 3 below) and stop applying rules.
Otherwise we proceed with the next rule.

#### Rule 2: determined phrases > undetermined phrases

We only arrive here if there are multiple ``Objc`` phrases, neither of which is marked.
In this case, we take the textually first one (by rule 3) which has the value ``det`` for its feature ``det``, if there is one, and stop applying rules.
Otherwise we proceed with the next rule.

#### Rule 3: earlier phrases > later phrases (by textual order)

This rule is implicitly applied if one of the rules before yielded more than one candidate for the principal object. Furthermore, we arrive here if the previous rules have not selected any principal direct object, while we do have more than one ``Objc`` phrase.

In this case, we pick the textually first ``Objc`` phrase.

### Complements as Objects

In some cases, a complement functions as objects, such as in [Genesis 21:13](https://shebanq.ancient-data.org/hebrew/text?nget=v&chapter=21&book=Genesis&qw=n&tp=txt_tb1&version=4b&mr=m) *I make him (into) a people*.

Candidates are those complements that: 

* start with either preposition ``L`` or ``K`` and
* the ``L`` or ``K`` in question does not carry a pronominal suffix
* should also not be followed by a body part

We are not sure whether these conditions are sufficient to warrant a direct object in all cases.
So we generated the preliminary grammatical label ``_promoted_direct_object`` in these cases.

In the review phase of the enrichment sheets, these cases must be resolved by changing this label to ``direct_object`` or ``NA``.

A promoted object ranks high as a *principal* direct object.

In [55]:
objectfuncs = set('''
Objc PreO PtcO
'''.strip().split())

cmpl_as_obj_preps = set('''
K L
'''.strip().split())

no_prs = set('''
absent n/a
'''.strip().split())

In [56]:
body_parts = set('''
>NP/ >P/ >PSJM/ >YB</ >ZN/
<JN/ <NQ/ <RP/ <YM/ <YM==/
BHN/ BHWN/ BVN/
CD=/ CD===/ CKM/ CN/
DD/
GRGRT/ GRM/ GRWN/ GW/ GW=/ GWJH/ GWPH/ GXWN/
FPH/
JD/ JRK/ JRKH/
KRF/ KSL=/ KTP/
L</ LCN/ LCWN/ LXJ/
M<H/ MPRQT/ MTL<WT/ MTNJM/ MYX/
NBLH=/
P<M/ PGR/ PH/ PM/ PNH/ PT=/
QRSL/
R>C/ RGL/
XDH/ XLY/ XMC=/ XRY/
YW>R/
ZRW</
'''.strip().split())

In [57]:
inf('Finding direct objects and determining the principal one')
directobjects = set()
pdirectobjects = set()
mobjects = collections.Counter() # count how many clauses have m objects (for each m)
cmobjects = collections.Counter() # count how many clauses have m cast objects (for each m)

def is_marked(phr):
    # simple criterion for determining whether a direct object is marked:
    # has it the object marker somewhere?
    words = L.d('word', p)
    has_et = False
    for w in words:
        if F.lex.v(w) == '>T':
            has_et = True
            break
    return has_et

for c in clause_verb:
    dobjects = {}
    dobjects_set = set()
    cast = set()
    
    for p in L.d('phrase', c):
        pf = pf_corr.get(p, F.function.v(p))  # NB we take the corrected value for phrase function if there is one
        if pf in objectfuncs:
            dobjects.setdefault('p_'+pf, set()).add(p)
            dobjects_set.add(p)
        elif pf == 'Cmpl':
            pwords = L.d('word', p)
            w1 = pwords[0]
            w1l = F.lex.v(w1)
            w2l = F.lex.v(pwords[1]) if len(pwords) > 1 else None
            if w1l in cmpl_as_obj_preps and F.prs.v(w1) in no_prs and not (w1l == 'L' and w2l in body_parts):
                cast.add(p)
                dobjects_set.add(p)
    ncast = len(cast)
    if ncast:
        cmobjects[ncast] += 1
        
    # find clause objects
    for ac in L.d('clause', L.u('sentence', c)):
        cr = F.rela.v(ac)
        if cr in {'Objc'} and list(C.mother.v(ac))[0] == c:
            dobjects.setdefault('c_'+cr, set()).add(ac)
            dobjects_set.add(ac)

    # order the objects in the natural ordering
    dobjects_order = sorted(dobjects_set, key=NK)
    nobjects = len(dobjects_order)
    mobjects[nobjects] += 1

    # compute the principal object
    principal_object = None

    for x in [1]:
        # just one object 
        if nobjects == 1:
            # we have chosen not to mark a principal object if there is only one object
            # the alternative is to mark it if it is a phrase. Uncomment the next 2 lines if you want this
            # theobject = list(dobjects_set)[0]
            # if F.otype.v(theobject) == 'phrase': principal_object = theobject
            break
        # rule 1: suffixes and promoted objects
        principal_candidates = dobjects.get('p_PreO', set()) | dobjects.get('p_PtcO', set()) | cast
        if len(principal_candidates) != 0:
            principal_object = sorted(principal_candidates, key=NK)[0]
            break
        principal_candidates = dobjects.get('p_Objc', set())
        if len(principal_candidates) != 0:
            if len(principal_candidates) == 1:
                principal_object = list(principal_candidates)[0]
                break
            objects_marked = set()
            objects_unmarked = set()
            for p in principal_candidates:
                if is_marked(p):
                    objects_marked.add(p)
                else:
                    objects_unmarked.add(p)
            if len(objects_marked) != 0:
                principal_object = sorted(objects_marked, key=NK)[0]
                break
            if len(objects_unmarked) != 0:
                principal_object = sorted(objects_unmarked, key=NK)[0]
                break            
    if principal_object != None:
        pdirectobjects.add(principal_object)

    if len(dobjects_set):
        directobjects |= dobjects_set

inf('Done') 

for (label, n) in sorted(mobjects.items(), key=lambda y: -y[0]):
    inf('{:<40}: {:>5}'.format('Clauses with {:>2} objects'.format(label), n))
for (label, n) in sorted(cmobjects.items(), key=lambda y: -y[0]):
    inf('{:<40}: {:>5}'.format('Clauses with {:>2} complements as objects'.format(label), n))

inf('{:<40}: {:>5}'.format('Clauses with a principal object', len(pdirectobjects)), withtime=False)
inf('{:<40}: {:>5}'.format('Clauses with a direct object', len(directobjects)), withtime=False)
inf('{:<40}: {:>5}'.format('Clauses with a cast direct object', sum(cmobjects.values())), withtime=False)
inf('{:<40}: {:>5}'.format('Total number of clauses', len(clause_verb)), withtime=False)

21m 52s Finding direct objects and determining the principal one
21m 53s Done
21m 53s Clauses with  3 objects                 :    75
21m 53s Clauses with  2 objects                 :  2482
21m 53s Clauses with  1 objects                 : 27238
21m 53s Clauses with  0 objects                 : 40336
21m 53s Clauses with  2 complements as objects  :    34
21m 53s Clauses with  1 complements as objects  :  3932
Clauses with a principal object         :  2548
Clauses with a direct object            : 32427
Clauses with a cast direct object       :  3966
Total number of clauses                 : 70131


### Finding indirect objects

The ETCBC database has not feature that marks indirect objects.
We will use computation to determine whether a complement is an indirect object or a locative.
This computation is just an approximation.

#### Cues for a locative complement

* ``# loc lexemes`` how many distinct lexemes with a locative meaning occur in the complement (given by a fixed list)
* ``# topo`` how many lexemes with nametype = ``topo`` occur in the complement (nametype is a feature of the lexicon)
* ``# prep_b`` how many occurrences of the preposition ``B`` occur in the complement
* ``# h_loc`` how many H-locales are carried on words in the complement
* ``body_part`` is 2 if the phrase starts with the preposition ``L`` followed by a body part, else 0
* ``locativity`` ($loc$) a crude measure of the locativity of the complement, just the sum of ``# loc lexemes``, ``#topo``, ``# prep_b``, ``# h_loc`` and ``body_part``.

#### Cues for an indirect object
* ``# prep_l`` how many occurrences of the preposition ``L`` or ``>L`` with a pronominal suffix on it occur in the complement
* ``# L prop`` how many occurrences of ``L`` or ``>L`` plus proper name or person reference word occur in the complement
* ``indirect object`` ($ind$) a crude indicator of whether the complement is an indirect object, just the sum of ``# prep_l`` and ``# L prop`` 

#### The decision

We take a decision as follows.
The outcome is $L$ (complement is *locative*) or $I$ (complement is *indirect object*) or $C$ (complement is neither *locative* nor *indirect object*)

(1) $ loc > 0 \wedge ind = 0 \Rightarrow L $

(2) $ loc = 0 \wedge ind > 0 \Rightarrow I $

(3) $ loc > 0 \wedge ind > 0 \wedge\ loc - 1 > ind \Rightarrow L$

(4) $ loc > 0 \wedge ind > 0 \wedge\ loc + 1 < ind \Rightarrow I$

(5) $ loc > 0 \wedge ind > 0 \wedge |ind - loc| <= 1 \Rightarrow C$

In words:

* if there are positive signals for L or I and none for the other, we choose the one for which there are positive signals;
* if there are positive signals for both L and I, we follow the majority count, but only if the difference is at least two;
* in all other cases we leave it at C: not necessarilty locative and not necessarily indirect object.

In [58]:
complfuncs = set('''
Cmpl PreC
'''.strip().split())

cmpl_as_iobj_preps = set('''
L >L
'''.strip().split())

In [59]:
locative_lexemes = set('''
>RY/ >YL/
<BR/ <BRH/ <BWR/ <C==/ <JR/ <L=/ <LJ=/ <LJH/ <LJL/ <MD=/ <MDH/ <MH/ <MQ/ <MQ===/ <QB/
BJT/
CM CMJM/ CMC/ C<R/
DRK/
FDH/
HR/
JM/ JRDN/ JRWCLM/ JFR>L/
MDBR/ MW<D/ MWL/ MZBX/ MYRJM/ MQWM/ MR>CWT/ MSB/ MSBH/ MVH==/
QDM/
SBJB/
TJMN/ TXT/ TXWT/
YPWN/
'''.strip().split())

personal_lexemes = set('''
>B/ >CH/ >DM/ >DRGZR/ >DWN/ >JC/ >J=/ >KR/ >LJL/ >LMN=/ >LMNH/ >LMNJ/ >LWH/ >LWP/ >M/ 
>MH/ >MN==/ >MWN=/ >NC/ >NWC/ >PH/ >PRX/ >SJR/ >SJR=/ >SP/ >X/ >XCDRPN/
>XWH/ >XWT/
<BDH=/ <CWQ/ <D=/ <DH=/ <LMH/ <LWMJM/ <M/ <MD/ <MJT/ <QR=/ <R/ <WJL/ <WL/ <WL==/ <WLL/
<WLL=/ <YRH/
B<L/ B<LH/ BKJRH/ BKR/ BN/ BR/ BR===/ BT/ BTWLH/ BWQR/ BXRJM/ BXWN/ BXWR/
CD==/ CDH/ CGL/ CKN/ CLCJM/ CLJC=/ CMRH=/ CPXH/ CW<R/ CWRR/
DJG/ DWD/ DWDH/ DWG/ DWR/
F<JR=/ FB/ FHD/ FR/ FRH/ FRJD/ FVN/
GBJRH/ GBR/ GBR=/ GBRT/ GLB/ GNB/ GR/ GW==/ GWJ/ GZBR/
HDBR/ 
J<RH/ JBM/ JBMH/ JD<NJ/ JDDWT/ JLD/ JLDH/ JLJD/ JRJB/ JSWR/ JTWM/ JWYR/
JYRJM/ 
KCP=/ KHN/ KLH/ KMR/ KN<NJ=/ KNT/ KRM=/ KRWB/ KRWZ/
L>M/ LHQH/ LMD/ LXNH/
M<RMJM/ M>WRH/ MCBR/ MCJX/ MCM<T/ MCMR/ MCPXH/ MCQLT/ MD<=/ MD<T/ MG/
MJNQT/ MKR=/ ML>K/ MLK/ MLKH/ MLKT/ MLX=/ MLYR/ MMZR/ MNZRJM/ MPLYT/
MPY=/ MQHL/ MQY<H/ MR</ MR>/ MSGR=/ MT/ MWRH/ MYBH=/
N<R/ N<R=/ N<RH/ N<RWT/ N<WRJM/ NBJ>/ NBJ>H/ NCJN/ NFJ>/ NGJD/ NJN/ NKD/ 
NKR/ NPC/ NPJLJM/ NQD/ NSJK/ NTJN/ 
PLGC/ PLJL/ PLJV/ PLJV=/ PQJD/ PR<H/ PRC/ PRJY/ PRJY=/ PRTMJM/ PRZWN/ 
PSJL/ PSL/ PVR/ PVRH/ PXH/ PXR/
QBYH/ QCRJM/ QCT=/ QHL/ QHLH/ QHLT/ QJM/ QYJN/
R<H=/ R<H==/ R<JH/ R<=/ R<WT/ R>H/ RB</ RB=/ RB==/ RBRBNJN/ RGMH/ RHB/ RKB=/
RKJL/ RMH/ RQX==/ 
SBL/ SPR=/ SRJS/ SRK/ SRNJM/ 
T<RWBWT/ TLMJD/ TLT=/ TPTJ/ TR<=/ TRCT>/ TRTN/ TWCB/ TWL<H/ TWLDWT/ TWTX/
VBX/ VBX=/ VBXH=/ VPSR/ VPXJM/
WLD/
XBL==/ XBL======/ XBR/ XBR=/ XBR==/ XBRH/ XBRT=/ XJ=/ XLC/ XM=/ XMWT/
XMWY=/ XNJK/ XR=/ XRC/ XRC====/ XRP=/ XRVM/ XTN/ XTP/ XZH=/
Y<JRH/ Y>Y>JM/ YJ/ YJD==/ YJR==/ YR=/ YRH=/ 
ZKWR/ ZMR=/ ZR</
'''.strip().split())

In [60]:
inf('Determinig kind of complements')

complements_c = collections.defaultdict(lambda: collections.defaultdict(lambda: []))
complements = {}
complementk = {}
kcomplements = collections.Counter()

nphrases = 0
ncomplements = 0

for c in clause_verb:
    for p in L.d('phrase', c):
        nphrases += 1
        pf = pf_corr.get(p, F.function.v(p))
        if pf not in complfuncs: continue
        ncomplements += 1
        words = L.d('word', p)
        lexemes = [F.lex.v(w) for w in words]
        lexeme_set = set(lexemes)

        # measuring locativity
        lex_locativity = len(locative_lexemes & lexeme_set)
        prep_b = len([x for x in lexeme_set if x == 'B'])
        topo = len([x for x in words if F.nametype.v(x) == 'topo'])
        h_loc = len([x for x in words if F.uvf.v(x) == 'H'])
        body_part = 0
        if len(words) > 1 and F.lex.v(words[0]) == 'L' and F.lex.v(words[1]) in body_parts:
            body_part = 2
        loca = lex_locativity + topo + prep_b + h_loc + body_part

        # measuring indirect object
        prep_l = len([x for x in words if F.lex.v(x) in cmpl_as_iobj_preps and F.prs.v(x) not in no_prs])
        prep_lpr = 0
        lwn = len(words)
        for (n, wn) in enumerate(words):
            if F.lex.v(wn) in cmpl_as_iobj_preps:
                if n+1 < lwn:
                    nextw = words[n+1]
                    if F.lex.v(nextw) in personal_lexemes or F.ls.v(nextw) == 'gntl' or (
                        F.sp.v(nextw) == 'nmpr' and F.nametype.v(nextw) == 'pers'):
                        prep_lpr += 1                        
        indi = prep_l + prep_lpr

        # the verdict
        ckind = 'C'
        if loca == 0 and indi > 0: ckind = 'I'
        elif loca > 0 and indi == 0: ckind = 'L'
        elif loca > indi + 1: ckind = 'L'
        elif loca < indi - 1: ckind = 'I'
        complementk[p] = (loca, indi, ckind)
        kcomplements[ckind] += 1
        complements_c[c][ckind].append(p)
        complements[p] = (pf, ckind)

inf('Done')
for (label, n) in sorted(kcomplements.items(), key=lambda y: -y[1]):
    inf('Phrases of kind {:<2}: {:>6}'.format(label, n), withtime=False)
inf('Total complements : {:>6}'.format(ncomplements), withtime=False)
inf('Total phrases     : {:>6}'.format(nphrases), withtime=False)

21m 58s Determinig kind of complements
21m 59s Done
Phrases of kind C :  17188
Phrases of kind L :  12081
Phrases of kind I :   7812
Total complements :  37081
Total phrases     : 214555


In [61]:
def has_L(vl, pn):
    words = L.d('word', pn)
    return len(words) > 0 and F.lex.v(words[0] == 'L')

def is_lex_personal(vl, pn):
    words = L.d('word', pn)
    return len(words) > 1 and F.lex.v(words[1] in personal_lexemes)

def is_lex_local(vl, pn):
    words = L.d('word', pn)
    return len({F.lex.v(w) for w in words} & locative_lexemes) > 0

def has_H_locale(vl, pn):
    words = L.d('word', pn)
    return len({w for w in words if F.uvf.v(w) == 'H'}) > 0  

### Generic logic

This is the function that applies the generic rules about (in)direct objects and locatives.
It takes a phrase node and a set of new label values, and modifies those values.

In [62]:
grule_as_str = {
    1: '''direct_object => principal_direct_object''',
    2: '''non-object => principal_direct_object''',
    3: '''non-object => direct_object''',
    4: '''direct-object superfluously promoted to direct object''',
    5: '''complement => indirect_object''',
    6: '''complement => location''',
    7: '''predicate complement => indirect_object''',
    8: '''predicate complement => location''',
}

def generic_logic(pn, values):
    gl = None
    if pn in pdirectobjects:
        oldv = values['grammatical']
        if oldv == 'direct_object':
            gl = 1
        else:
            gl = 2
            values['original'] = oldv
        values['grammatical'] = 'principal_direct_object'
    elif pn in directobjects:
        oldv = values['grammatical']
        if oldv != 'direct_object':
            gl = 3
            values['original'] = oldv
            values['grammatical'] = 'direct_object'
    elif pn in complements:
        (pf, ck) = complements[pn]
        if ck in {'I', 'L'}:
            if pf == 'Cmpl':
                if ck == 'I':
                    values['grammatical'] = 'indirect_object'
                    gl = 5
                else:
                    values['valence'] = 'adjunct'
                    values['lexical'] = 'location'
                    values['semantic'] = 'location'
                    gl = 6
            elif pf == 'PreC':
                if ck == 'I':
                    values['grammatical'] = 'indirect_object'
                    gl = 7
                else:
                    values['lexical'] = 'location'
                    values['semantic'] = 'location'
                    gl = 8
    return gl

### 5.1.1 Verb specific rules

The verb-specific enrichment rules are stored in a dictionary, keyed  by the verb lexeme.
The rule itself is a list of items.

The last item is a tuple of conditions that need to be fulfilled to apply the rule.

A condition can take the shape of

* a function, taking a phrase node as argument and returning a boolean value
* an ETCBC feature for phrases : value, which is true iff that feature has that value for the phrase in question

In [63]:
enrich_logic = {
    'CJT': [
        (
            ('semantic', 'benefactive'), 
            ('function:Adju', has_L, is_lex_personal),
        ),
        (
            ('lexical', 'location'),
            ('function:Cmpl', has_H_locale),
        ),
        (
            ('lexical', 'location'),
            ('semantic', 'location'),
            ('function:Cmpl', is_lex_local),
        ),
    ],    
}

In [64]:
rule_index = collections.defaultdict(lambda: [])

def rule_as_str(vl, i):
    (conditions, sfassignments) = rule_index[vl][i]
    result = '{}-{} '.format(vl, i+1)
    pref_len = len(result)
    result += 'if {}\n'.format(' AND '.join(
        '{:<10} = {:<8}'.format(
                *c.split(':')
            ) if type(c) is str else '{:<15}'.format(
                c.__name__
            ) for c in conditions,
    ))
    for (i, sfa) in enumerate(sfassignments):
        result += '{}{:<10} => {:<15}\n'.format(' '* pref_len, *sfa)
    return result

def check_logic():
    errors = 0
    nrules = 0
    for vl in sorted(enrich_logic):
        for items in enrich_logic[vl]:
            rule_index[vl].append((items[-1], items[0:-1]))
        for (i, (conditions, sfassignments)) in enumerate(rule_index[vl]):
            inf(rule_as_str(vl, i))
            nrules += 1
            for (sf, sfval) in sfassignments:
                if sf not in enrich_fields:
                    msg('"{}" not a valid enrich field'.format(sf), withtime=False)
                    errors += 1
                elif sfval not in enrich_fields[sf]:
                    msg('`{}`: "{}" not a valid enrich field value'.format(sf, sfval), withtime=False)
                    errors += 1
            for c in conditions:
                if type(c) == str:
                    x = c.split(':')
                    if len(x) != 2:
                        msg('Wrong feature condition {}'.format(c))
                        errors += 1
                    else:
                        (feat, val) = x
                        if feat not in legal_values:
                            msg('Feature `{}` not in use'.format(feat))
                            errors += 1
                        elif val not in legal_values[feat]:
                            msg('Feature `{}`: not a valid value "{}"'.format(feat, val))
                            errors += 1
    if errors:
        msg('There were {} errors in {} rules'.format(errors, nrules))
    else:
        inf('All {} rules OK'.format(nrules))

check_logic()

22m 15s CJT-1 if function   = Adju     AND has_L           AND is_lex_personal
      semantic   => benefactive    

22m 15s CJT-2 if function   = Cmpl     AND has_H_locale   
      lexical    => location       

22m 15s CJT-3 if function   = Cmpl     AND is_lex_local   
      lexical    => location       
      semantic   => location       

22m 15s All 3 rules OK


In [65]:
generic_cases = {}
applied_cases = {}

def apply_logic(vl, pn, init_values):
    values = deepcopy(init_values)
    gr = generic_logic(pn, values)
    if gr:
        generic_cases.setdefault(gr, []).append(pn)
    verb_rules = enrich_logic.get(vl, [])
    for (i, items) in enumerate(verb_rules):
        conditions = items[-1]
        sfassignments = items[0:-1]

        ok = True
        for condition in conditions:
            if type(condition) is str:
                (feature, value) = condition.split(':')
                fval = pf_corr.get(pn, F.function.v(pn)) if feature == 'function' else F.item[feature].v(pn)
                this_ok =  fval == value
            else:
                this_ok = condition(vl, pn)
            if not this_ok:
                ok = False
                break
        if ok:
            for (sf, sfval) in sfassignments:
                values[sf] = sfval
            applied_cases.setdefault((vl, i), []).append(pn)
    return tuple(values[sf] for sf in enrich_fields)

In [66]:
COMMON_FIELDS = '''
    cnode#
    vnode#
    pnode#
    book
    chapter
    verse
    link
    verb_lexeme
    verb_stem
    verb_occurrence
'''.strip().split()

CLAUSE_FIELDS = '''
    clause_text    
'''.strip().split()

PHRASE_FIELDS = '''
    phrase_text
    function
'''.strip().split() + list(enrich_fields)

field_names = []
for f in COMMON_FIELDS: field_names.append(f)
for i in range(max((len(CLAUSE_FIELDS), len(PHRASE_FIELDS)))):
    pf = PHRASE_FIELDS[i] if i < len(PHRASE_FIELDS) else '--'
    field_names.append(pf)
    
fillrows = len(CLAUSE_FIELDS) - len(PHRASE_FIELDS)
cfillrows = 0 if fillrows >= 0 else -fillrows
pfillrows = fillrows if fillrows >= 0 else 0
inf('\n'.join(field_names), withtime=False)    

cnode#
vnode#
pnode#
book
chapter
verse
link
verb_lexeme
verb_stem
verb_occurrence
phrase_text
function
valence
predication
grammatical
original
lexical
semantic


In [67]:
phrases_seen = collections.Counter()

def gen_sheet_enrich(verb):
    rows = []
    fieldsep = ';'
    clauses_seen = set()
    for wn in occs[verb]:
        cl = L.u('clause', wn)
        if cl in clauses_seen: continue
        clauses_seen.add(cl)
        cn = L.u('clause', wn)
        vn = L.u('verse', wn)
        bn = L.u('book', wn)
        book = T.book_name(bn, lang='en')
        chapter = F.chapter.v(vn)
        verse = F.verse.v(vn)
        ln = ln_base+(ln_tpl.format(T.book_name(bn, lang='la'), chapter, verse))+ln_tweak
        lnx = '''"=HYPERLINK(""{}""; ""link"")"'''.format(ln)
        vl = F.lex.v(wn).rstrip('[=')
        vstem = F.vs.v(wn)
        vt = T.words([wn], fmt='ec').replace('\n', '')
        ct = T.words(L.d('word', cn), fmt='ec').replace('\n', '')
        
        common_fields = (cn, wn, -1, book, chapter, verse, lnx, vl, vstem, vt)
        clause_fields = (ct,)
        rows.append(common_fields + clause_fields + (('',)*cfillrows))
        for pn in L.d('phrase', cn):
            phrases_seen[pn] += 1
            common_fields = (cn, wn, pn, book, chapter, verse, vl, vt)
            pt = T.words(L.d('word', pn), fmt='ec').replace('\n', '')
            pf = pf_corr.get(pn, F.function.v(pn))
            phrase_fields = (pt, pf) + apply_logic(vl, pn, transform[pf])            
            rows.append(common_fields + phrase_fields + (('',)*pfillrows))
    filename = vfile(verb, 'enrich_blank')
    row_file = open(filename, 'w')
    row_file.write('{}\n'.format(fieldsep.join(field_names)))
    for row in rows:
        row_file.write('{}\n'.format(fieldsep.join(str(x) for x in row)))
    row_file.close()
    inf('Generated enrichment sheet for verb {} ({} rows)'.format(filename, len(rows)))
    
for verb in verbs: gen_sheet_enrich(verb)

inf('Done')
inf('{} rules applied'.format(len(applied_cases)), withtime=False)
totaln = 0
for rule_id in generic_cases:
    cases = generic_cases[rule_id]
    n = len(cases)
    totaln += n
    inf('{}\n\t{:>4} phrases: {}\n'.format(
        grule_as_str[rule_id], n, ', '.join(str(c) for c in cases[0:10]),
    ), withtime=False)
inf('{} generic applications in total'.format(totaln), withtime=False)
totaln = 0
for rule_id in applied_cases:
    cases = applied_cases[rule_id]
    n = len(cases)
    totaln += n
    inf('{}\n\t{:>4} phrases: {}\n'.format(
        rule_as_str(*rule_id), n, ', '.join(str(c) for c in cases[0:10]),
    ), withtime=False)
inf('{} specific applications in total'.format(totaln), withtime=False)

stats = collections.Counter()
for (p, times) in phrases_seen.items(): stats[times] += 1
for (times, n) in sorted(stats.items(), key=lambda y: (-y[1], y[0])):
    inf('{:<6} phrases seen {:<2} time(s)'.format(n, times))
inf('Total phrases seen: {}'.format(len(phrases_seen)))

22m 20s Generated enrichment sheet for verb /Users/dirk/Dropbox/SYNVAR/enrich_blank/FJM_etcbc4b.csv (2857 rows)
22m 20s Generated enrichment sheet for verb /Users/dirk/Dropbox/SYNVAR/enrich_blank/JRD_etcbc4b.csv (1553 rows)
22m 21s Generated enrichment sheet for verb /Users/dirk/Dropbox/SYNVAR/enrich_blank/oFH_etcbc4b.csv (11048 rows)
22m 21s Generated enrichment sheet for verb /Users/dirk/Dropbox/SYNVAR/enrich_blank/NPL_etcbc4b.csv (1915 rows)
22m 21s Generated enrichment sheet for verb /Users/dirk/Dropbox/SYNVAR/enrich_blank/CJT_etcbc4b.csv (375 rows)
22m 21s Generated enrichment sheet for verb /Users/dirk/Dropbox/SYNVAR/enrich_blank/oLH_etcbc4b.csv (3821 rows)
22m 21s Generated enrichment sheet for verb /Users/dirk/Dropbox/SYNVAR/enrich_blank/NWS_etcbc4b.csv (613 rows)
22m 21s Generated enrichment sheet for verb /Users/dirk/Dropbox/SYNVAR/enrich_blank/QRa_etcbc4b.csv (3662 rows)
22m 22s Generated enrichment sheet for verb /Users/dirk/Dropbox/SYNVAR/enrich_blank/PQD_etcbc4b.csv (1269

In [68]:
def showcase(pn):
    inf('''{} {}\n{}'''.format(
        pf_corr.get(pn, F.function.v(pn)),
        T.words(L.d('word', pn), fmt='ec'), 
        T.text(
            book=F.book.v(L.u('book', pn)), 
            chapter=int(F.chapter.v(L.u('chapter', pn))),
            verse=int(F.verse.v(L.u('verse', pn))), 
            fmt='ec', lang='la',
        ),
    ), withtime=False)

In [69]:
showcase(654844)

PreC PQWDJ HLWJ LM#PXTM 
Numeri 26:57	W>LH PQWDJ HLWJ LM#PXTM LGR#WN M#PXT HGR#NJ LQHT M#PXT HQHTJ LMRRJ M#PXT HMRRJ00



In [70]:
def check_h(vl, show_results=False):
    hl = {}
    total = 0
    for w in F.otype.s('word'):
        if F.sp.v(w) != 'verb' or F.lex.v(w).rstrip('[=/') != vl: continue
        total += 1
        c = L.u('clause', w)
        ps = L.d('phrase', c)
        phs = {p for p in ps if len({w for w in L.d('word', p) if F.uvf.v(w) == 'H'}) > 0}
        for f in ('Cmpl', 'Adju', 'Loca'):
            phc = {p for p in ps if pf_corr.get(p, None) or (pf_corr.get(p, F.function.v(p))) == f}
            if len(phc & phs): hl.setdefault(f, set()).add(w)
    for f in hl:
        inf('Verb {}: {} occurrences. He locales in {} phrases: {}'.format(vl, total, f, len(hl[f])), withtime=False)
        if show_results: inf('\t{}'.format(', '.join(str(x) for x in hl[f])), withtime=False)
check_h('BW>', show_results=True)        

Verb BW>: 2570 occurrences. He locales in Cmpl phrases: 157
	26118, 26127, 146447, 187920, 197138, 272406, 95257, 184350, 398368, 289826, 201253, 24616, 78897, 401459, 100410, 32829, 100413, 198208, 5698, 200258, 100938, 24653, 141902, 112207, 186960, 24658, 196690, 28764, 34400, 298594, 248931, 132198, 162918, 12402, 5747, 146044, 396927, 153216, 134792, 151176, 188042, 97419, 426120, 257165, 136338, 21656, 162970, 200349, 214687, 24740, 257192, 158378, 100527, 25777, 160434, 214707, 4789, 4793, 272569, 139963, 90812, 249020, 38595, 113861, 138448, 8920, 282841, 19166, 20703, 26850, 43235, 145127, 8424, 8937, 170729, 397032, 254703, 154354, 200948, 426230, 176376, 79609, 165626, 206075, 208636, 27391, 269569, 106246, 157447, 26380, 149785, 170782, 211232, 126758, 26414, 27438, 246062, 109363, 172340, 249140, 398134, 64828, 26431, 16704, 4929, 168771, 154964, 132955, 393569, 47460, 157541, 47466, 100206, 37232, 269170, 23415, 410999, 23933, 24448, 78208, 133518, 25999, 191381, 12698, 1

It would be handy to generate an informational spreadsheet that shows all these cases.

## 5.1 Process the enrichments

We read the enrichments, perform some consistency checks, and produce an annotation package.
If the filled-in sheet does not exist, we take the blank sheet, with the default assignment of the new features.
If a phrase got conflicting features, because it occurs in sheets for multiple verbs, the values in the filled-in sheet take precedence over the values in the blank sheet. If both occur in a filled in sheet, a warning will be issued.

In [71]:
phrases_seen = collections.Counter()

def read_enrich(rootdir): # rootdir will not be used, data is computed from sheets
    pf_enriched = {
        False: {}, # for enrichments found in blank sheets
        True: {}, # for enrichments found in filled sheets
    }
    repeated = {
        False: collections.defaultdict(list), # for blank sheets
        True: collections.defaultdict(list), # for filled sheets
    }
    wrong_value = {
        False: collections.defaultdict(list),
        True: collections.defaultdict(list),
    }

    non_phrase = collections.defaultdict(list)
    wrong_node = collections.defaultdict(list)

    results = []
    dev_results = [] # results that deviate from the filled sheet
    
    ERR_LIMIT = 10

    for verb in sorted(verbs):
        vresults = {
            False: {}, # for blank sheets
            True: {}, # for filled sheets
        }
        for check in (
            (False, 'blank'), 
            (True, 'filled'),
        ):
            is_filled = check[0]
            filename = vfile(verb, 'enrich_{}'.format(check[1]))
            if not os.path.exists(filename):
                msg('NO {} enrichments file {}'.format(check[1], filename))
                continue
            #inf('READING {} enrichments file {}'.format(check[1], filename))

            with open(filename) as f:
                header = f.__next__()
                for line in f:
                    fields = line.rstrip().split(';')
                    pn = int(fields[2])
                    if pn < 0: continue
                    phrases_seen[pn] += 1
                    vvals = tuple(fields[-nef:])
                    for (f, v) in zip(enrich_fields, vvals):
                        if v != '' and v != 'NA' and v not in enrich_fields[f]:
                            wrong_value[is_filled][pn].append((verb, f, v))
                    vresults[is_filled][pn] = vvals
                    if pn in pf_enriched[is_filled]:
                        if pn not in repeated[is_filled]:
                            repeated[is_filled][pn] = [pf_enriched[is_filled][pn]]
                        repeated[is_filled][pn].append((verb, vvals))
                    else:
                        pf_enriched[is_filled][pn] = (verb, vvals)
                    if F.otype.v(pn) != 'phrase': 
                        non_phrase[pn].append(verb)
            for pn in sorted(vresults[True]):          # check whether the phrase ids are not mangled
                if pn not in vresults[False]:
                    wrong_node[pn].append(verb)
            for pn in sorted(vresults[False]):      # now collect all results, give precedence to filled values
                f_corr = pn in pf_corr                               # manual correction in phrase function
                f_good = pf_corr.get(pn, F.function.v(pn))
                s_manual = pn in vresults[True] and vresults[False][pn] != vresults[True][pn] # real change
                these_results = vresults[True][pn] if s_manual else vresults[False][pn]
                if f_corr or s_manual:
                    dev_results.append((pn,)+these_results+(f_good, f_corr, s_manual))
                results.append((pn,)+these_results+(f_good, f_corr, s_manual))

    for check in (
        (False, 'blank'), 
        (True, 'filled'),
    ):
        if len(wrong_value[check[0]]): #illegal values in sheets
            wrongs = wrong_value[check[0]]
            for x in sorted(wrongs)[0:ERR_LIMIT]:
                px = T.words(L.d('word', x), fmt='ev')
                cx = T.words(L.d('word', L.u('clause', x)), fmt='ev')
                passage = T.passage(x)
                msg('ERROR: {} Illegal value(s) in {}: {} = {} in {}:'.format(
                    passage, check[1], x, px, cx
                ), withtime=False)
                for (verb, f, v) in wrongs[x]:
                    msg('\t"{}" is an illegal value for "{}" in verb {}'.format(
                        v, f, verb,
                    ), withtime=False)
            ne = len(wrongs)
            if ne > ERR_LIMIT: msg('... AND {} CASES MORE'.format(ne - ERR_LIMIT), withtime=False)
        else:
            inf('OK: The used {} enrichment sheets have legal values'.format(check[1]))

        nerrors = 0
        if len(repeated[check[0]]): # duplicates in sheets, check consistency
            repeats = repeated[check[0]]
            for x in sorted(repeats):
                overview = collections.defaultdict(list)
                for y in repeats[x]: overview[y[1]].append(y[0])
                px = T.words(L.d('word', x), fmt='ev')
                cx = T.words(L.d('word', L.u('clause', x)), fmt='ev')
                passage = T.passage(x)
                if len(overview) > 1:
                    nerrors += 1
                    if nerrors < ERR_LIMIT:
                        msg('ERROR: {} Conflict in {}: {} = {} in {}:'.format(
                            passage, check[1], x, px, cx
                        ), withtime=False)
                        for vals in overview:
                            msg('\t{:<40} in verb(s) {}'.format(
                                ', '.join(vals),
                                ', '.join(overview[vals]),
                        ), withtime=False)
                elif False: # for debugging purposes
                #else:
                    nerrors += 1
                    if nerrors < ERR_LIMIT:
                        inf('{} Agreement in {} {} = {} in {}: {}'.format(
                            passage, check[1], x, px, cx, ','.join(list(overview.values())[0]),
                        ), withtime=False)
            ne = nerrors
            if ne > ERR_LIMIT: msg('... AND {} CASES MORE'.format(ne - ERR_LIMIT), withtime=False)
        if nerrors == 0:
            inf('OK: The used {} enrichment sheets are consistent'.format(check[1]))

    if len(non_phrase):
        msg('ERROR: Enrichments have been applied to non-phrase nodes:')
        for x in sorted(non_phrase)[0:ERR_LIMIT]:
            px = T.words(L.d('word', x), fmt='ev')
            msg('{}: {} Node {} is not a phrase but a {}'.format(
                non_phrase[x], T.passage(x), x, F.otype.v(x),
            ), withtime=False)
        ne = len(non_phrase)
        if ne > ERR_LIMIT: msg('... AND {} CASES MORE'.format(ne - ERR_LIMIT), withtime=False)
    else:
        inf('OK: all enriched nodes where phrase nodes')

    if len(wrong_node):
        msg('ERROR: Node in filled sheet did not occur in blank sheet:')
        for x in sorted(wrong_node)[0:ERR_LIMIT]:
            px = T.words(L.d('word', x), fmt='ev')
            msg('{}: {} node {}'.format(
                non_phrase[x], T.passage(x), x,
            ), withtime=False)
        ne = len(wrong_node)
        if ne > ERR_LIMIT: msg('... AND {} CASES MORE'.format(ne - ERR_LIMIT), withtime=False)
    else:
        inf('OK: all enriched nodes occurred in the blank sheet')

    if len(dev_results):
        inf('OK: there are {} manual correction/enrichment annotations'.format(len(dev_results)))
        for r in dev_results[0:ERR_LIMIT]:
            (x, *vals, f_good, f_corr, s_manual) = r
            px = T.words(L.d('word', x), fmt='ev')
            cx = T.words(L.d('word', L.u('clause', x)), fmt='ev')
            inf('{:<30} {:>7} => {:<3} {:<3} {}\n\t{}\n\t\t{}'.format(
                'COR' if f_corr else '',
                'MAN' if s_manual else'',
                T.passage(x), x, ','.join(vals), px, cx
            ), withtime=False)
        ne = len(dev_results)
        if ne > ERR_LIMIT: inf('... AND {} ANNOTATIONS MORE'.format(ne - ERR_LIMIT), withtime=False)
    else:
        msg('WARNING: there are no manual correction/enrichment annotations')
    return results

corr = ExtraData(API)
corr.deliver_annots(
    'complements', 
    {'title': 'Verb complement enrichments', 'date': '2016-06'},
    [
        (None, 'complements', read_enrich, tuple(
            ('JanetDyk', 'ft', fname) for fname in list(enrich_fields.keys())+['function', 'f_correction', 's_manual']
        ))
    ],
)

stats = collections.Counter()
for (p, times) in phrases_seen.items(): stats[times] += 1
for (times, n) in sorted(stats.items(), key=lambda y: (-y[1], y[0])):
    inf('{:<6} phrases seen {:<2} time(s)'.format(n, times))
inf('Total phrases seen: {}'.format(len(phrases_seen)))

22m 39s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/oBR_etcbc4b.csv
22m 39s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/oFH_etcbc4b.csv
22m 39s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/oLH_etcbc4b.csv
22m 39s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/BRa_etcbc4b.csv
22m 39s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/BWa_etcbc4b.csv
22m 39s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/CJT_etcbc4b.csv
22m 39s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/CWB_etcbc4b.csv
22m 39s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/FJM_etcbc4b.csv
22m 40s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/HLK_etcbc4b.csv
22m 40s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/JRD_etcbc4b.csv
22m 40s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/JYa_

22m 40s OK: The used blank enrichment sheets have legal values
22m 40s OK: The used blank enrichment sheets are consistent
22m 40s OK: The used filled enrichment sheets have legal values
22m 40s OK: The used filled enrichment sheets are consistent
22m 40s OK: all enriched nodes where phrase nodes
22m 40s OK: all enriched nodes occurred in the blank sheet
22m 40s OK: there are 678 manual correction/enrichment annotations
COR                                    => 2_Kings 22:6 726393 ,,,,,
	LEX@R@CIJm W:LAB.ONIJm W:LAG.OD:RIJm 
		W:JIT.:NW. >OTOW L:<OF;J HAM.:L@>K@H LEX@R@CIJm W:LAB.ONIJm W:LAG.OD:RIJm 
COR                                    => Genesis 1:27 605442 complement,NA,direct_object,,,
	Z@K@R W.N:Q;B@H 
		Z@K@R W.N:Q;B@H B.@R@> >OT@m00

COR                                    => Genesis 5:2 606420 complement,NA,direct_object,,,
	Z@K@R W.N:Q;B@H 
		Z@K@R W.N:Q;B@H B.:R@>@m 
COR                                    => Isaiah 4:5 728419 adjunct,NA,NA,,location,location
	<AL K.@L&M:KOWn

# 6 Annox complements
We load the new and modified features into the LAF-Fabric API, in the process of which they will be compiled.

Note that we draw in the new annotations by specifying an *annox* called `complements` (the second argument of the `fabric.load` function).

Then we turn that data into LAF annotations. Every enrichment is stored in new features, 
with names specified above in ``enrich_fields``, 
with label `ft` and namespace `JanetDyk`.

In [72]:
API=fabric.load(source+version, 'complements', 'flow_corr', {
    "xmlids": {"node": False, "edge": False},
    "features": ('''
        oid otype
        sp vs lex
        chapter verse
        function JanetDyk:ft.function
        s_manual f_correction
    ''' + ' '.join(enrich_fields),
    '''
    '''),
    "prepare": prepare,
}, verbose='DETAIL')
exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.00s DETAIL: COMPILING m: etcbc4b: UP TO DATE
  0.00s USING main: etcbc4b DATA COMPILED AT: 2015-11-02T15-08-56
  0.00s BEGIN COMPILE a: complements
  0.03s DETAIL: load main: X. [node]  -> 
  1.51s DETAIL: load main: X. [e]  -> 
  3.71s DETAIL: load main: G.node_anchor_min
  3.81s DETAIL: load main: G.node_anchor_max
  3.90s DETAIL: load main: G.node_sort
  4.00s DETAIL: load main: G.node_sort_inv
  4.73s DETAIL: load main: G.edges_from
  4.85s DETAIL: load main: G.edges_to
  4.99s LOGFILE=/Users/dirk/laf/laf-fabric-data/etcbc4b/bin/A/complements/__log__compile__.txt
  4.99s PARSING ANNOTATION FILES
  5.12s INFO: parsing complements.xml
  8.49s INFO: END PARSING
         0 good   regions  and     0 faulty ones
         0 linked nodes    and     0 unlinked ones
         0 good   edges    and     0 faulty ones
     52300 good   annots   and     0 faulty ones
    470700 good   features and     0 faulty ones
     52300 distinct xml identifiers

  8

## Simple test
Take the first 10 phrases and retrieve the corrected and uncorrected function feature.
Note that the corrected function feature is only filled in, if it occurs in a clause in which a selected verb occurs.

In [73]:
for i in list(F.otype.s('phrase'))[0:10]: 
    print('{} - {} - {}'.format(
        F.function.v(i), 
        F.JanetDyk_ft_function.v(i),
        L.u('clause', i) in clause_verb_selected,
    ))

Time - Time - True
Pred - Pred - True
Subj - Subj - True
Objc - Objc - True
Conj - None - False
Subj - None - False
Pred - None - False
PreC - None - False
Conj - None - False
Subj - None - False


## Results

We put all corrections and enrichments in a single csv file for checking.

In [74]:
f = open(all_results, 'w')
NALLFIELDS = 12
tpl = ('{};' * (NALLFIELDS - 1))+'{}\n'

inf('collecting phrases ...')
f.write(tpl.format(
    '-',
    '-',
    'passage',
    'verb(s) text',
    '-',
    '-',
    '-',
    '-',
    '-',
    '-',
    'clause text',
    'clause node',
))
f.write(tpl.format(
    'corrected',
    'enriched',
    'passage',
    '-',
    'valence',
    'predication',
    'grammatical',
    'original',
    'lexical',
    'semantic',
    'phrase text',
    'phrase node',
))
i = 0
j = 0
c = 0
CHUNK_SIZE = 10000
for cn in sorted(clause_verb_selected):
    c += 1
    vrbs = sorted(clause_verb_selected[cn])
    f.write(tpl.format(
        '',
        '',
        T.passage(cn),
        ' '.join(F.lex.v(verb) for verb in vrbs),
        '',
        '',
        '',
        '',
        '',
        '',
        T.words(L.d('word', cn), fmt='ec').replace('\n', ' '),
        cn,
    ))
    for pn in L.d('phrase', cn):
        i += 1
        j += 1
        if j == CHUNK_SIZE:
            j = 0
            inf('{:>6} phrases in {:>5} clauses ...'.format(i, c))
        f.write(tpl.format(
            'COR' if F.f_correction.v(pn) == 'True' else '',
            'MAN' if F.s_manual.v(pn) == 'True' else '',
            T.passage(pn),
            '',
            F.valence.v(pn),
            F.predication.v(pn),
            F.grammatical.v(pn),
            F.original.v(pn),
            F.lexical.v(pn),
            F.semantic.v(pn),
            T.words(L.d('word', pn), fmt='ec').replace('\n', ' '),
            pn,
        ))
f.close()
inf('{:>6} phrases in {:>5} clauses done'.format(i, c))

    11s collecting phrases ...
    12s  10000 phrases in  2910 clauses ...
    12s  20000 phrases in  5916 clauses ...
    13s  30000 phrases in  9055 clauses ...
    14s  40000 phrases in 12168 clauses ...
    15s  50000 phrases in 15374 clauses ...
    15s  52300 phrases in 16053 clauses done
