<a href="http://www.godgeleerdheid.vu.nl/etcbc" target="_blank"><img align="left" src="images/VU-ETCBC-xsmall.png"/></a>
<a href="http://laf-fabric.readthedocs.org/en/latest/" target="_blank"><img align="right" src="images/laf-fabric-xsmall.png"/></a>
<a href="http://www.persistent-identifier.nl/?identifier=urn%3Anbn%3Anl%3Aui%3A13-048i-71" target="_blank"><img align="right"src="images/etcbc4easy-small.png"/></a>

# Complement corrections


# 0. Introduction

Joint work of Dirk Roorda and Janet Dyk.

In order to do
[flowchart analysis](https://shebanq.ancient-data.org/shebanq/static/docs/tools/valence/flowchart.html)
on verbs, we need to correct some coding errors.

Because the flowchart assigns meanings to verbs depending on the number and nature of complements found in their context, it is important that the phrases in those clauses are labeled correctly, i.e. that the
[function](https://shebanq.ancient-data.org/shebanq/static/docs/featuredoc/features/comments/function.html)
feature for those phrases have the correct label.

# 1. Task
In this notebook we do the following tasks:

* generate correction sheets for selected verbs,
* transform the set of filled in correction sheets into an annotation package

Between the first and second task, the sheets will have been filled in by Janet with corrections.

The resulting annotation package offers the corrections as the value of a new feature, also called `function`, but now in the annotation space `JanetDyk` instead of `etcbc4`.

# 2. Implementation

Start the engines, and note the import of the `ExtraData` functionality from the `etcbc.extra` module.
This module can turn data with anchors into additional LAF annotations to the big ETCBC LAF resource.

In [1]:
import sys,os, collections
from copy import deepcopy

import laf
from laf.fabric import LafFabric
from etcbc.preprocess import prepare
from etcbc.extra import ExtraData

fabric = LafFabric()

  0.00s This is LAF-Fabric 4.6.2
API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html
Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html



In [2]:
source = 'etcbc'
version = '4b'

We instruct the API to load data.
Note that we ask for the XML identifiers, because `ExtraData` needs them to stitch the corrections into the LAF XML.

In [3]:
API = fabric.load(source+version, '--', 'flow_corr', {
    "xmlids": {"node": True, "edge": False},
    "features": ('''
        oid otype
        sp vs lex uvf
        function
        chapter verse
    ''',''),
    "prepare": prepare,
}, verbose='NORMAL')
exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.00s USING main  DATA COMPILED AT: 2015-11-02T15-08-56
  5.86s LOGFILE=/Users/dirk/laf/laf-fabric-output/etcbc4b/flow_corr/__log__flow_corr.txt
  5.86s INFO: LOADING PREPARED data: please wait ... 
  5.86s prep prep: G.node_sort
  5.97s prep prep: G.node_sort_inv
  6.50s prep prep: L.node_up
  9.79s prep prep: L.node_down
    15s prep prep: V.verses
    15s prep prep: V.books_la
    15s ETCBC reference: http://laf-fabric.readthedocs.org/en/latest/texts/ETCBC-reference.html
    18s INFO: LOADED PREPARED data
    18s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX lexicon FOR TASK flow_corr AT 2016-05-27T11-31-57


# 2.1 Locations

In [4]:
ln_base = 'https://shebanq.ancient-data.org/hebrew/text'
ln_tpl = '?book={}&chapter={}&verse={}'
ln_tweak = '&version=4b&mr=m&qw=n&tp=txt_tb1&tr=hb&wget=x&qget=v&nget=x'

home_dir = os.path.expanduser('~').replace('\\', '/')
base_dir = '{}/Dropbox/SYNVAR'.format(home_dir)
kinds = ('corr_blank', 'corr_filled', 'enrich_blank', 'enrich_filled')
kdir = {}
for k in kinds:
    kd = '{}/{}'.format(base_dir, k)
    kdir[k] = kd
    if not os.path.exists(kd):
        os.makedirs(kd)

def vfile(verb, kind):
    if kind not in kinds:
        msg('Unknown kind `{}`'.format(kind))
        return None
    return '{}/{}_{}{}.csv'.format(kdir[kind], verb.replace('>','a').replace('<', 'o'), source, version)

# 2.2 Domain
Here is the set of verbs that interest us.

In [5]:
verbs_initial = set('''
    CJT
    BR>
    QR>
'''.strip().split())

motion_verbs = set('''
    <BR
    <LH
    BW>
    CWB
    HLK
    JRD
    JY>
    NPL
    NWS
    SWR
'''.strip().split())

double_object_verbs = set('''
    NTN
    <FH
    FJM
'''.strip().split())

complex_qal_verbs = set('''
    NF>
    PQD
'''.strip().split())

verbs = verbs_initial | motion_verbs | double_object_verbs | complex_qal_verbs

# 2.3 Phrase function

We need to correct some values of the phrase function.
When we receive the corrections, we check whether they have legal values.
Here we look up the possible values.

In [6]:
legal_values = dict(
    function={F.function.v(p) for p in F.otype.s('phrase')},
)

We generate a list of occurrences of those verbs, organized by the lexeme of the verb.
We need some extra values, to indicate other coding errors.

In [7]:
error_values = dict(
    function=dict(
        BoundErr='this phrase is part of another phrase and does not merit its own function value',
    ),
)

We add the error_values to the legal values.

In [8]:
for feature in set(legal_values.keys()) | set(error_values.keys()):
    ev = error_values.get(feature, {})
    if ev:
        lv = legal_values.setdefault(feature, set())
        lv |= set(ev.keys())
inf('{}'.format(legal_values))

  1.33s {'function': {'Subj', 'PrAd', 'NCoS', 'Conj', 'Exst', 'EPPr', 'PtcO', 'Modi', 'NCop', 'Supp', 'Objc', 'BoundErr', 'ModS', 'PreS', 'Loca', 'Frnt', 'Intj', 'Ques', 'Nega', 'Voct', 'Cmpl', 'PreO', 'PrcS', 'PreC', 'ExsS', 'Adju', 'Pred', 'Time', 'IntS', 'Rela'}}


In [9]:
inf('Finding occurrences ...')
occs = collections.defaultdict(list)           # dictionary of all verb occurrence nodes per verb lexeme
clause_verb = collections.defaultdict(list)    # dictionary of all verb occurrence nodes per clause node
clause_verb_selected = collections.defaultdict(list) # idem but for the occurrences of selected verbs

nw = 0
nws = 0
for n in F.otype.s('word'):
    if F.sp.v(n) != 'verb': continue
    nw += 1
    lex = F.lex.v(n).rstrip('/=[')
    occs[lex].append(n)
    cn = L.u('clause', n)
    clause_verb[cn].append(n)
    if lex in verbs:
        nws += 1
        clause_verb_selected[cn].append(n)

inf('Done')
inf('Total:    {:>6} verb occurrences in {} clauses'.format(nw, len(clause_verb)), withtime=False)
inf('Selected: {:>6} verb occurrences in {} clauses'.format(nws, len(clause_verb_selected)), withtime=False)

for verb in sorted(verbs):
    inf('{} {:<5} occurrences'.format(verb, len(occs[verb])), withtime=False)    

  1.36s Finding occurrences ...
  3.03s Done
Total:     73679 verb occurrences in 70131 clauses
Selected:  16209 verb occurrences in 16053 clauses
<BR 556   occurrences
<FH 2629  occurrences
<LH 890   occurrences
BR> 54    occurrences
BW> 2570  occurrences
CJT 85    occurrences
CWB 1056  occurrences
FJM 609   occurrences
HLK 1554  occurrences
JRD 377   occurrences
JY> 1069  occurrences
NF> 656   occurrences
NPL 445   occurrences
NTN 2017  occurrences
NWS 159   occurrences
PQD 303   occurrences
QR> 883   occurrences
SWR 297   occurrences


# 3 Blank sheet generation
Generate correction sheets.
They are CSV files. Every row corresponds to a verb occurrence.
The fields per row are the node numbers of the clause in which the verb occurs, the node number of the verb occurrence, the text of the verb occurrence (in ETCBC transliteration, consonantal) a passage label (book, chapter, verse), and then 4 columns for each phrase in the clause:

* phrase node number
* phrase text (ETCBC translit consonantal)
* original value of the `function` feature
* corrected value of the `function` feature (generated as empty)

In [27]:
phrases_seen = collections.Counter()

def gen_sheet(verb):
    rows = []
    fieldsep = ';'
    field_names = '''
        clause#
        word#
        passage
        link
        verb
        stem
    '''.strip().split()
    max_phrases = 0
    clauses_seen = set()
    for wn in occs[verb]:
        cln = L.u('clause', wn)
        if cln in clauses_seen: continue
        clauses_seen.add(cln)
        vn = L.u('verse', wn)
        bn = L.u('book', wn)
        ch = F.chapter.v(vn)
        vs = F.verse.v(vn)
        passage_label = T.passage(vn)
        ln = ln_base+(ln_tpl.format(T.book_name(bn, lang='la'), ch, vs))+ln_tweak
        lnx = '''"=HYPERLINK(""{}""; ""link"")"'''.format(ln)
        vt = T.words([wn], fmt='ec').replace('\n', '')
        vstem = F.vs.v(wn)
        row = [cln, wn, passage_label, lnx, vt, vstem]
        phrases = L.d('phrase', cln)
        n_phrases = len(phrases)
        if n_phrases > max_phrases: max_phrases = n_phrases
        for pn in phrases:
            phrases_seen[pn] += 1
            pt = T.words(L.d('word', pn), fmt='ec').replace('\n', '')
            pf = F.function.v(pn)
            row.extend((pn, pt, pf, ''))
        rows.append(row)
    for i in range(max_phrases):
        field_names.extend('''
            phr{i}#
            phr{i}_txt
            phr{i}_function
            phr{i}_corr
        '''.format(i=i+1).strip().split())
    filename = vfile(verb, 'corr_blank')
    row_file = open(filename, 'w')
    row_file.write('{}\n'.format(fieldsep.join(field_names)))
    for row in rows:
        row_file.write('{}\n'.format(fieldsep.join(str(x) for x in row)))
    row_file.close()
    inf('Generated correction sheet for verb {}'.format(filename))
    
for verb in verbs: gen_sheet(verb)
    
stats = collections.Counter()
for (p, times) in phrases_seen.items(): stats[times] += 1
for (times, n) in sorted(stats.items(), key=lambda y: (-y[1], y[0])):
    inf('{:<6} phrases seen {:<2} time(s)'.format(n, times))
inf('Total phrases seen: {}'.format(len(phrases_seen)))

23m 15s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/CJT_etcbc4b.csv
23m 15s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/QRa_etcbc4b.csv
23m 15s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/FJM_etcbc4b.csv
23m 15s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/NTN_etcbc4b.csv
23m 16s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/oBR_etcbc4b.csv
23m 16s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/BWa_etcbc4b.csv
23m 16s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/NPL_etcbc4b.csv
23m 16s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/NWS_etcbc4b.csv
23m 16s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/PQD_etcbc4b.csv
23m 16s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/NFa_etcbc4b.csv
23m 16s Generated co

# 4 Processing corrections
We read the filled-in correction sheets and extract the correction data out of it.
We store the corrections in a dictionary keyed by the phrase node.
We check whether we get multiple corrections for the same phrase.

In [26]:
phrases_seen = collections.Counter()
pf_corr = {}

def read_corr():
    function_values = legal_values['function']

    for verb in sorted(verbs):
        repeated = collections.defaultdict(list)
        non_phrase = set()
        illegal_fvalue = set()

        filename = vfile(verb, 'corr_filled')
        if not os.path.exists(filename):
            msg('NO file {}'.format(filename))
            continue
        else:
            inf('Processing {}'.format(filename))
        with open(filename) as f:
            header = f.__next__()
            for line in f:
                fields = line.rstrip().split(';')
                for i in range(1, len(fields)//4):
                    (pn, pc) = (fields[2+4*i], fields[2+4*i+3])
                    if pn != '':
                        pc = pc.strip()
                        pn = int(pn)
                        phrases_seen[pn] += 1
                        if pc != '':
                            good = True
                            for i in [1]:
                                good = False
                                if pn in pf_corr:
                                    repeated[pn] += pc
                                    continue
                                if pc not in function_values:
                                    illegal_fvalue.add(pc)
                                    continue
                                if F.otype.v(pn) != 'phrase': 
                                    non_phrase.add(pn)
                                    continue
                                good = True
                            if good:
                                pf_corr[pn] = pc

        inf('{}: Found {:>5} corrections in {}'.format(verb, len(pf_corr), filename))
        if len(repeated):
            msg('ERROR: Some phrases have been corrected multiple times!')
            for x in sorted(repeated):
                msg('{:>6}: {}'.format(x, ', '.join(repeated[x])))
        else:
            inf('OK: Corrected phrases did not receive multiple corrections')
        if len(non_phrase):
            msg('ERROR: Corrections have been applied to non-phrase nodes: {}'.format(','.join(non_phrase)))
        else:
            inf('OK: all corrected nodes where phrase nodes')
        if len(illegal_fvalue):
            msg('ERROR: Some corrections supply illegal values for phrase function!')
            msg('`{}`'.format('`, `'.join(illegal_fvalue)))
        else:
            inf('OK: all corrected values are legal')
    inf('Found {} corrections in the phrase function'.format(len(pf_corr)))
        
read_corr()

stats = collections.Counter()
for (p, times) in phrases_seen.items(): stats[times] += 1
for (times, n) in sorted(stats.items(), key=lambda y: (-y[1], y[0])):
    inf('{:<6} phrases seen {:<2} time(s)'.format(n, times))
inf('Total phrases seen: {}'.format(len(phrases_seen)))

15m 46s NO file /Users/dirk/Dropbox/SYNVAR/corr_filled/oBR_etcbc4b.csv
15m 46s NO file /Users/dirk/Dropbox/SYNVAR/corr_filled/oFH_etcbc4b.csv
15m 46s NO file /Users/dirk/Dropbox/SYNVAR/corr_filled/oLH_etcbc4b.csv


15m 46s Processing /Users/dirk/Dropbox/SYNVAR/corr_filled/BRa_etcbc4b.csv
15m 46s BR>: Found     2 corrections in /Users/dirk/Dropbox/SYNVAR/corr_filled/BRa_etcbc4b.csv
15m 46s OK: Corrected phrases did not receive multiple corrections
15m 46s OK: all corrected nodes where phrase nodes
15m 46s OK: all corrected values are legal
15m 46s Processing /Users/dirk/Dropbox/SYNVAR/corr_filled/BWa_etcbc4b.csv
15m 46s BW>: Found    57 corrections in /Users/dirk/Dropbox/SYNVAR/corr_filled/BWa_etcbc4b.csv
15m 46s OK: Corrected phrases did not receive multiple corrections
15m 46s OK: all corrected nodes where phrase nodes
15m 46s OK: all corrected values are legal
15m 46s Processing /Users/dirk/Dropbox/SYNVAR/corr_filled/CJT_etcbc4b.csv
15m 46s CJT: Found    60 corrections in /Users/dirk/Dropbox/SYNVAR/corr_filled/CJT_etcbc4b.csv
15m 46s OK: Corrected phrases did not receive multiple corrections
15m 46s OK: all corrected nodes where phrase nodes
15m 46s OK: all corrected values are legal
15m 46s Pr

15m 46s NO file /Users/dirk/Dropbox/SYNVAR/corr_filled/FJM_etcbc4b.csv


15m 46s Processing /Users/dirk/Dropbox/SYNVAR/corr_filled/HLK_etcbc4b.csv
15m 46s HLK: Found   269 corrections in /Users/dirk/Dropbox/SYNVAR/corr_filled/HLK_etcbc4b.csv
15m 46s OK: Corrected phrases did not receive multiple corrections
15m 46s OK: all corrected nodes where phrase nodes
15m 46s OK: all corrected values are legal


15m 46s NO file /Users/dirk/Dropbox/SYNVAR/corr_filled/JRD_etcbc4b.csv
15m 46s NO file /Users/dirk/Dropbox/SYNVAR/corr_filled/JYa_etcbc4b.csv
15m 46s NO file /Users/dirk/Dropbox/SYNVAR/corr_filled/NFa_etcbc4b.csv
15m 46s NO file /Users/dirk/Dropbox/SYNVAR/corr_filled/NPL_etcbc4b.csv


15m 46s Processing /Users/dirk/Dropbox/SYNVAR/corr_filled/NTN_etcbc4b.csv
15m 47s NTN: Found   336 corrections in /Users/dirk/Dropbox/SYNVAR/corr_filled/NTN_etcbc4b.csv
15m 47s OK: Corrected phrases did not receive multiple corrections
15m 47s OK: all corrected nodes where phrase nodes
15m 47s OK: all corrected values are legal


15m 47s NO file /Users/dirk/Dropbox/SYNVAR/corr_filled/NWS_etcbc4b.csv
15m 47s NO file /Users/dirk/Dropbox/SYNVAR/corr_filled/PQD_etcbc4b.csv


15m 47s Processing /Users/dirk/Dropbox/SYNVAR/corr_filled/QRa_etcbc4b.csv
15m 47s QR>: Found   339 corrections in /Users/dirk/Dropbox/SYNVAR/corr_filled/QRa_etcbc4b.csv
15m 47s OK: Corrected phrases did not receive multiple corrections
15m 47s OK: all corrected nodes where phrase nodes
15m 47s OK: all corrected values are legal


15m 47s NO file /Users/dirk/Dropbox/SYNVAR/corr_filled/SWR_etcbc4b.csv


15m 47s Found 339 corrections in the phrase function
15m 47s 26090  phrases seen 1  time(s)
15m 47s 33     phrases seen 2  time(s)
15m 47s Total phrases seen: 26123


# 5. Enrichment

We create blank sheets for new feature assignments, based on the corrected data.

In [28]:
enrich_field_spec = '''
valence
    adjunct
    complement
    core

grammatical
    *
    subject
    object
    copula
    copula+subject
    predication
    predication+subject
    predication+object

lexical
    location
    time

semantic
    benefactive
    time
    location
'''
enrich_fields = collections.OrderedDict()
cur_e = None
for line in enrich_field_spec.strip().split('\n'):
    if line.startswith(' '):
        enrich_fields.setdefault(cur_e, set()).add(line.strip())
    else:
        cur_e = line.strip()
if None in enrich_fields:
    msg('Invalid enrich field specification')
else:
    inf('Enrich field specification OK')
for ef in enrich_fields:
    print('{} = {{{}}}'.format(ef, ', '.join(sorted(enrich_fields[ef]))))

30m 00s Enrich field specification OK
valence = {adjunct, complement, core}
grammatical = {*, copula, copula+subject, object, predication, predication+object, predication+subject, subject}
lexical = {location, time}
semantic = {benefactive, location, time}


In [29]:
specs = '''
Adju	Adjunct	adjunct	NA		
Cmpl	Complement	complement	*		
Conj	Conjunction	NA	NA	NA	NA
EPPr	Enclitic personal pronoun	NA	copula		
ExsS	Existence with subject suffix	core	copula+subject		
Exst	Existence	core	copula		
Frnt	Fronted element	NA	NA	NA	NA
Intj	Interjection	NA	NA	NA	NA
IntS	Interjection with subject suffix	core	subject		
Loca	Locative	adjunct	NA	location	location
Modi	Modifier	NA	NA	NA	NA
ModS	Modifier with subject suffix	core	subject		
NCop	Negative copula	core	copula		
NCoS	Negative copula with subject suffix	core	copula+subject		
Nega	Negation	NA	NA	NA	NA
Objc	Object	complement	object		
PrAd	Predicative adjunct	adjunct	NA		
PrcS	Predicate complement with subject suffix	core	predication+subject		
PreC	Predicate complement	core	predication		
Pred	Predicate	core	predication		
PreO	Predicate with object suffix	core	predication+object		
PreS	Predicate with subject suffix	core	predication+subject		
PtcO	Participle with object suffix	core	predication+object		
Ques	Question	NA	NA	NA	NA
Rela	Relative	NA	NA	NA	NA
Subj	Subject	core	subject		
Supp	Supplementary constituent	adjunct	NA		benefactive
Time	Time reference	adjunct	NA	time	time
Unkn	Unknown	NA	NA	NA	NA
Voct	Vocative	NA	NA	NA	NA'''.strip().split('\n')

In [30]:
transform = {}
for line in specs:
    x = line.split('\t') 
    transform[x[0]] = dict(zip(enrich_fields, x[2:]))
for e in error_values['function']:
    transform[e] = dict(zip(enrich_fields, ['NA']*4))

errors = 0
good = 0
for f in transform:
    for e in enrich_fields:
        val = transform[f][e]
        if val != '' and val != 'NA' and val not in enrich_fields[e]:
            msg('Defaults for `{}`: wrong `{}` value: "{}"'.format(f, e, val))
            errors += 1
        else: good += 1
if errors:
    msg('There were {} errors ({} good)'.format(errors, good))
else:
    inf('Defaults OK ({} good)'.format(good))

30m 02s Defaults OK (124 good)


In [31]:
ltpl = '{:<8}: {:<15} {:<20} {:<15} {:<15}'
print(ltpl.format('func', *enrich_fields))
for f in sorted(transform):
    sfs = transform[f]
    print(ltpl.format(f, *[sfs[sf] for sf in enrich_fields]))

func    : valence         grammatical          lexical         semantic       
Adju    : adjunct         NA                                                  
BoundErr: NA              NA                   NA              NA             
Cmpl    : complement      *                                                   
Conj    : NA              NA                   NA              NA             
EPPr    : NA              copula                                              
ExsS    : core            copula+subject                                      
Exst    : core            copula                                              
Frnt    : NA              NA                   NA              NA             
IntS    : core            subject                                             
Intj    : NA              NA                   NA              NA             
Loca    : adjunct         NA                   location        location       
ModS    : core            subject                   

## 5.1 Enrichment logic

For certain verbs and certain conditions, we can automatically fill in some of the new features.
For example, if the verb is `CJT`, and if an adjunct phrase is personal, starting with `L`, we know that the semantic role is *benefactive*.

In [32]:
locative_lexemes = set('''
>RY/ >YL/
<BR/ <BRH/ <BWR/ <C==/ <JR/ <L=/ <LJ=/ <LJH/ <LJL/ <MD=/ <MDH/ <MH/ <MQ/ <MQ===/ <QB/
BJT/
CM CMJM/ CMC/ C<R/
DRK/
FDH/
HR/
JM/ JRDN/ JRWCLM/ JFR>L/
MDBR/ MW<D/ MWL/ MZBX/ MYRJM/ MQWM/ MR>CWT/ MSB/ MSBH/ MVH==/
QDM/
SBJB/
TJMN/ TXT/ TXWT/
YPWN/
'''.strip().split())

personal_lexemes = set('''
>B/ >CH/ >DM/ >DRGZR/ >DWN/ >JC/ >J=/ >KR/ >LJL/ >LMN=/ >LMNH/ >LMNJ/ >LWH/ >LWP/ >M/ 
>MH/ >MN==/ >MWN=/ >NC/ >NWC/ >PH/ >PRX/ >SJR/ >SJR=/ >SP/ >X/ >XCDRPN/
>XWH/ >XWT/
<BDH=/ <CWQ/ <D=/ <DH=/ <LMH/ <LWMJM/ <M/ <MD/ <MJT/ <QR=/ <R/ <WJL/ <WL/ <WL==/ <WLL/
<WLL=/ <YRH/
B<L/ B<LH/ BKJRH/ BKR/ BN/ BR/ BR===/ BT/ BTWLH/ BWQR/ BXRJM/ BXWN/ BXWR/
CD==/ CDH/ CGL/ CKN/ CLCJM/ CLJC=/ CMRH=/ CPXH/ CW<R/ CWRR/
DJG/ DWD/ DWDH/ DWG/ DWR/
F<JR=/ FB/ FHD/ FR/ FRH/ FRJD/ FVN/
GBJRH/ GBR/ GBR=/ GBRT/ GLB/ GNB/ GR/ GW==/ GWJ/ GZBR/
HDBR/ 
J<RH/ JBM/ JBMH/ JD<NJ/ JDDWT/ JLD/ JLDH/ JLJD/ JRJB/ JSWR/ JTWM/ JWYR/
JYRJM/ 
KCP=/ KHN/ KLH/ KMR/ KN<NJ=/ KNT/ KRM=/ KRWB/ KRWZ/
L>M/ LHQH/ LMD/ LXNH/
M<RMJM/ M>WRH/ MCBR/ MCJX/ MCM<T/ MCMR/ MCPXH/ MCQLT/ MD<=/ MD<T/ MG/
MJNQT/ MKR=/ ML>K/ MLK/ MLKH/ MLKT/ MLX=/ MLYR/ MMZR/ MNZRJM/ MPLYT/
MPY=/ MQHL/ MQY<H/ MR</ MR>/ MSGR=/ MT/ MWRH/ MYBH=/
N<R/ N<R=/ N<RH/ N<RWT/ N<WRJM/ NBJ>/ NBJ>H/ NCJN/ NFJ>/ NGJD/ NJN/ NKD/ 
NKR/ NPC/ NPJLJM/ NQD/ NSJK/ NTJN/ 
PLGC/ PLJL/ PLJV/ PLJV=/ PQJD/ PR<H/ PRC/ PRJY/ PRJY=/ PRTMJM/ PRZWN/ 
PSJL/ PSL/ PVR/ PVRH/ PXH/ PXR/
QBYH/ QCRJM/ QCT=/ QHL/ QHLH/ QHLT/ QJM/ QYJN/
R<H=/ R<H==/ R<JH/ R<=/ R<WT/ R>H/ RB</ RB=/ RB==/ RBRBNJN/ RGMH/ RHB/ RKB=/
RKJL/ RMH/ RQX==/ 
SBL/ SPR=/ SRJS/ SRK/ SRNJM/ 
T<RWBWT/ TLMJD/ TLT=/ TPTJ/ TR<=/ TRCT>/ TRTN/ TWCB/ TWL<H/ TWLDWT/ TWTX/
VBX/ VBX=/ VBXH=/ VPSR/ VPXJM/
WLD/
XBL==/ XBL======/ XBR/ XBR=/ XBR==/ XBRH/ XBRT=/ XJ=/ XLC/ XM=/ XMWT/
XMWY=/ XNJK/ XR=/ XRC/ XRC====/ XRP=/ XRVM/ XTN/ XTP/ XZH=/
Y<JRH/ Y>Y>JM/ YJ/ YJD==/ YJR==/ YR=/ YRH=/ 
ZKWR/ ZMR=/ ZR</
'''.strip().split())

In [33]:
def has_L(vl, pn):
    words = L.d('word', pn)
    return len(words) > 0 and F.lex.v(words[0] == 'L')

def is_lex_personal(vl, pn):
    words = L.d('word', pn)
    return len(words) > 1 and F.lex.v(words[1] in personal_lexemes)

def is_lex_local(vl, pn):
    words = L.d('word', pn)
    return len({F.lex.v(w) for w in words} & locative_lexemes) > 0

def has_H_locale(vl, pn):
    words = L.d('word', pn)
    return len({w for w in words if F.uvf.v(w) == 'H'}) > 0  

### 5.1.1 Verb specific rules

The verb-specific enrichment rules are stored in a dictionary, keyed  by the verb lexeme.
The rule itself is a list of items.

The last item is a tuple of conditions that need to be fulfilled to apply the rule.

A condition can take the shape of

* a function, taking a phrase node as argument and returning a boolean value
* an ETCBC feature for phrases : value, which is true iff that feature has that value for the phrase in question

In [34]:
enrich_logic = {
    'CJT': [
        (
            ('semantic', 'benefactive'), 
            ('function:Adju', has_L, is_lex_personal),
        ),
        (
            ('lexical', 'location'),
            ('function:Cmpl', has_H_locale),
        ),
        (
            ('lexical', 'location'),
            ('semantic', 'location'),
            ('function:Cmpl', is_lex_local),
        ),
    ],    
}

In [35]:
rule_index = collections.defaultdict(lambda: [])

def rule_as_str(vl, i):
    (conditions, sfassignments) = rule_index[vl][i]
    result = '{}-{} '.format(vl, i+1)
    pref_len = len(result)
    result += 'if {}\n'.format(' AND '.join(
        '{:<10} = {:<8}'.format(
                *c.split(':')
            ) if type(c) is str else '{:<15}'.format(
                c.__name__
            ) for c in conditions,
    ))
    for (i, sfa) in enumerate(sfassignments):
        result += '{}{:<10} => {:<15}\n'.format(' '* pref_len, *sfa)
    return result

def check_logic():
    errors = 0
    nrules = 0
    for vl in sorted(enrich_logic):
        for items in enrich_logic[vl]:
            rule_index[vl].append((items[-1], items[0:-1]))
        for (i, (conditions, sfassignments)) in enumerate(rule_index[vl]):
            inf(rule_as_str(vl, i))
            nrules += 1
            for (sf, sfval) in sfassignments:
                if sf not in enrich_fields:
                    msg('"{}" not a valid enrich field'.format(sf), withtime=False)
                    errors += 1
                elif sfval not in enrich_fields[sf]:
                    msg('`{}`: "{}" not a valid enrich field value'.format(sf, sfval), withtime=False)
                    errors += 1
            for c in conditions:
                if type(c) == str:
                    x = c.split(':')
                    if len(x) != 2:
                        msg('Wrong feature condition {}'.format(c))
                        errors += 1
                    else:
                        (feat, val) = x
                        if feat not in legal_values:
                            msg('Feature `{}` not in use'.format(feat))
                            errors += 1
                        elif val not in legal_values[feat]:
                            msg('Feature `{}`: not a valid value "{}"'.format(feat, val))
                            errors += 1
    if errors:
        msg('There were {} errors in {} rules'.format(errors, nrules))
    else:
        inf('All {} rules OK'.format(nrules))

check_logic()

30m 07s CJT-1 if function   = Adju     AND has_L           AND is_lex_personal
      semantic   => benefactive    

30m 07s CJT-2 if function   = Cmpl     AND has_H_locale   
      lexical    => location       

30m 07s CJT-3 if function   = Cmpl     AND is_lex_local   
      lexical    => location       
      semantic   => location       

30m 07s All 3 rules OK


In [36]:
applied_cases = {}

def apply_logic(vl, pn, init_values):
    values = deepcopy(init_values)
    verb_rules = enrich_logic.get(vl, [])
    for (i, items) in enumerate(verb_rules):
        conditions = items[-1]
        sfassignments = items[0:-1]

        ok = True
        for condition in conditions:
            if type(condition) is str:
                (feature, value) = condition.split(':')
                this_ok = F.item[feature].v(pn) == value
            else:
                this_ok = condition(vl, pn)
            if not this_ok:
                ok = False
                break
        if ok:
            for (sf, sfval) in sfassignments:
                values[sf] = sfval
            applied_cases.setdefault((vl, i), []).append(pn)
    return tuple(values[sf] for sf in enrich_fields)

In [37]:
COMMON_FIELDS = '''
    cnode#
    vnode#
    pnode#
    book
    chapter
    verse
    link
    verb_lexeme
    verb_stem
    verb_occurrence
'''.strip().split()

CLAUSE_FIELDS = '''
    clause_text    
'''.strip().split()

PHRASE_FIELDS = '''
    phrase_text
    function
'''.strip().split() + list(enrich_fields)

field_names = []
for f in COMMON_FIELDS: field_names.append(f)
for i in range(max((len(CLAUSE_FIELDS), len(PHRASE_FIELDS)))):
    pf = PHRASE_FIELDS[i] if i < len(PHRASE_FIELDS) else '--'
    field_names.append(pf)
    
fillrows = len(CLAUSE_FIELDS) - len(PHRASE_FIELDS)
cfillrows = 0 if fillrows >= 0 else -fillrows
pfillrows = fillrows if fillrows >= 0 else 0
print('\n'.join(field_names))    

cnode#
vnode#
pnode#
book
chapter
verse
link
verb_lexeme
verb_stem
verb_occurrence
phrase_text
function
valence
grammatical
lexical
semantic


In [38]:
phrases_seen = collections.Counter()

def gen_sheet_enrich(verb):
    rows = []
    fieldsep = ';'
    clauses_seen = set()
    for wn in occs[verb]:
        cl = L.u('clause', wn)
        if cl in clauses_seen: continue
        clauses_seen.add(cl)
        cn = L.u('clause', wn)
        vn = L.u('verse', wn)
        bn = L.u('book', wn)
        book = T.book_name(bn, lang='en')
        chapter = F.chapter.v(vn)
        verse = F.verse.v(vn)
        ln = ln_base+(ln_tpl.format(T.book_name(bn, lang='la'), chapter, verse))+ln_tweak
        lnx = '''"=HYPERLINK(""{}""; ""link"")"'''.format(ln)
        vl = F.lex.v(wn).rstrip('[=')
        vstem = F.vs.v(wn)
        vt = T.words([wn], fmt='ec').replace('\n', '')
        ct = T.words(L.d('word', cn), fmt='ec').replace('\n', '')
        
        common_fields = (cn, wn, -1, book, chapter, verse, lnx, vl, vstem, vt)
        clause_fields = (ct,)
        rows.append(common_fields + clause_fields + (('',)*cfillrows))
        for pn in L.d('phrase', cn):
            phrases_seen[pn] += 1
            common_fields = (cn, wn, pn, book, chapter, verse, vl, vt)
            pt = T.words(L.d('word', pn), fmt='ec').replace('\n', '')
            pf = pf_corr.get(pn, None) or F.function.v(pn)
            phrase_fields = (pt, pf) + apply_logic(vl, pn, transform[pf])            
            rows.append(common_fields + phrase_fields + (('',)*pfillrows))
    filename = vfile(verb, 'enrich_blank')
    row_file = open(filename, 'w')
    row_file.write('{}\n'.format(fieldsep.join(field_names)))
    for row in rows:
        row_file.write('{}\n'.format(fieldsep.join(str(x) for x in row)))
    row_file.close()
    inf('Generated enrichment sheet for verb {} ({} rows)'.format(filename, len(rows)))
    
for verb in verbs: gen_sheet_enrich(verb)

inf('Done')
print('{} rules applied'.format(len(applied_cases)))
totaln = 0
for rule_id in applied_cases:
    cases = applied_cases[rule_id]
    n = len(cases)
    totaln += n
    print('{}\n\t{:>4} phrases: {}\n'.format(rule_as_str(*rule_id), n, ', '.join(str(c) for c in cases[0:10])))
print('{} applications in total'.format(totaln))

stats = collections.Counter()
for (p, times) in phrases_seen.items(): stats[times] += 1
for (times, n) in sorted(stats.items(), key=lambda y: (-y[1], y[0])):
    inf('{:<6} phrases seen {:<2} time(s)'.format(n, times))
inf('Total phrases seen: {}'.format(len(phrases_seen)))

30m 11s Generated enrichment sheet for verb /Users/dirk/Dropbox/SYNVAR/enrich_blank/CJT_etcbc4b.csv (375 rows)
30m 11s Generated enrichment sheet for verb /Users/dirk/Dropbox/SYNVAR/enrich_blank/QRa_etcbc4b.csv (3662 rows)
30m 11s Generated enrichment sheet for verb /Users/dirk/Dropbox/SYNVAR/enrich_blank/FJM_etcbc4b.csv (2857 rows)
30m 12s Generated enrichment sheet for verb /Users/dirk/Dropbox/SYNVAR/enrich_blank/NTN_etcbc4b.csv (9659 rows)
30m 12s Generated enrichment sheet for verb /Users/dirk/Dropbox/SYNVAR/enrich_blank/oBR_etcbc4b.csv (2309 rows)
30m 13s Generated enrichment sheet for verb /Users/dirk/Dropbox/SYNVAR/enrich_blank/BWa_etcbc4b.csv (10817 rows)
30m 13s Generated enrichment sheet for verb /Users/dirk/Dropbox/SYNVAR/enrich_blank/NPL_etcbc4b.csv (1915 rows)
30m 13s Generated enrichment sheet for verb /Users/dirk/Dropbox/SYNVAR/enrich_blank/NWS_etcbc4b.csv (613 rows)
30m 13s Generated enrichment sheet for verb /Users/dirk/Dropbox/SYNVAR/enrich_blank/PQD_etcbc4b.csv (1269

30m 15s Done
2 rules applied
CJT-1 if function   = Adju     AND has_L           AND is_lex_personal
      semantic   => benefactive    

	   5 phrases: 615130, 630648, 712015, 794512, 797440

CJT-3 if function   = Cmpl     AND is_lex_local   
      lexical    => location       
      semantic   => location       

	   7 phrases: 606396, 619338, 630956, 654145, 776266, 789542, 797377

12 applications in total
30m 15s 52110  phrases seen 1  time(s)
30m 15s 181    phrases seen 2  time(s)
30m 15s 9      phrases seen 3  time(s)
30m 15s Total phrases seen: 52300


In [39]:
def check_h(vl, show_results=False):
    hl = {}
    total = 0
    for w in F.otype.s('word'):
        if F.sp.v(w) != 'verb' or F.lex.v(w).rstrip('[=/') != vl: continue
        total += 1
        c = L.u('clause', w)
        ps = L.d('phrase', c)
        phs = {p for p in ps if len({w for w in L.d('word', p) if F.uvf.v(w) == 'H'}) > 0}
        for f in ('Cmpl', 'Adju', 'Loca'):
            phc = {p for p in ps if pf_corr.get(p, None) or F.function.v(p) == f}
            if len(phc & phs): hl.setdefault(f, set()).add(w)
    for f in hl:
        print('Verb {}: {} occurrences. He locales in {} phrases: {}'.format(vl, total, f, len(hl[f])))
        if show_results: print('\t{}'.format(', '.join(str(x) for x in hl[f])))
check_h('BW>', show_results=True)        

Verb BW>: 2570 occurrences. He locales in Loca phrases: 14
	90243, 93571, 29637, 284965, 289859, 136745, 257293, 289871, 154354, 154964, 9525, 257016, 284989, 93598
Verb BW>: 2570 occurrences. He locales in Cmpl phrases: 157
	26118, 26127, 146447, 187920, 197138, 272406, 95257, 184350, 398368, 289826, 201253, 24616, 78897, 401459, 100410, 32829, 100413, 198208, 5698, 200258, 100938, 24653, 141902, 112207, 186960, 24658, 196690, 28764, 34400, 298594, 248931, 132198, 162918, 12402, 5747, 146044, 396927, 153216, 134792, 151176, 188042, 97419, 426120, 257165, 136338, 21656, 162970, 200349, 214687, 24740, 257192, 158378, 100527, 25777, 160434, 214707, 4789, 4793, 272569, 139963, 90812, 249020, 38595, 113861, 138448, 8920, 282841, 19166, 20703, 26850, 43235, 145127, 8424, 8937, 170729, 397032, 254703, 154354, 200948, 426230, 176376, 79609, 165626, 206075, 208636, 27391, 269569, 106246, 157447, 26380, 149785, 170782, 211232, 126758, 26414, 27438, 246062, 109363, 172340, 249140, 398134, 64828,

It would be handy to generate an informational spreadsheet that shows all these cases.

## 5.1 Process the enrichments

We read the enrichments, perform some consistency checks, and produce an annotation package.
If the filled-in sheet does not exist, we take the blank sheet, with the default assignment of the new features.
If a phrase got conflicting features, because it occurs in sheets for multiple verbs, the values in the filled-in sheet take precedence over the values in the blank sheet. If both occur in a filled in sheet, a warning will be issued.

In [43]:
phrases_seen = collections.Counter()

def read_enrich(rootdir): # rootdir will not be used, data is computed from sheets
    pf_enriched = {
        False: {}, # for enrichments found in blank sheets
        True: {}, # for enrichments found in filled sheets
    }
    repeated = {
        False: collections.defaultdict(list), # for blank sheets
        True: collections.defaultdict(list), # for filled sheets
    }
    wrong_value = {
        False: collections.defaultdict(list),
        True: collections.defaultdict(list),
    }

    non_phrase = collections.defaultdict(list)
    wrong_node = collections.defaultdict(list)

    results = []
    dev_results = [] # results that deviate from the filled sheet
    
    ERR_LIMIT = 10

    for verb in sorted(verbs):
        vresults = {
            False: {}, # for blank sheets
            True: {}, # for filled sheets
        }
        for check in (
            (False, 'blank'), 
            (True, 'filled'),
        ):
            is_filled = check[0]
            filename = vfile(verb, 'enrich_{}'.format(check[1]))
            if not os.path.exists(filename):
                msg('NO {} enrichments file {}'.format(check[1], filename))
                continue
            with open(filename) as f:
                header = f.__next__()
                for line in f:
                    fields = line.rstrip().split(';')
                    pn = int(fields[2])
                    if pn < 0: continue
                    phrases_seen[pn] += 1
                    vvals = tuple(fields[-4:])
                    for (f, v) in zip(enrich_fields, vvals):
                        if v != '' and v != 'NA' and v not in enrich_fields[f]:
                            wrong_value[is_filled][pn].append((verb, f, v))
                    vresults[is_filled][pn] = vvals
                    if pn in pf_enriched[is_filled]:
                        if pn not in repeated[is_filled]:
                            repeated[is_filled][pn] = [pf_enriched[is_filled][pn]]
                        repeated[is_filled][pn].append((verb, vvals))
                    else:
                        pf_enriched[is_filled][pn] = (verb, vvals)
                    if F.otype.v(pn) != 'phrase': 
                        non_phrase[pn].append(verb)
            for pn in sorted(vresults[True]):          # check whether the phrase ids are not mangled
                if pn not in vresults[False]:
                    wrong_node[pn].append(verb)
            for pn in sorted(vresults[False]):      # now collect all results, give precedence to filled values
                f_corr = pn in pf_corr                               # manual correction in phrase function
                s_manual = pn in vresults[True] and vresults[False][pn] != vresults[True][pn] # real change
                these_results = vresults[True][pn] if s_manual else vresults[False][pn]
                if f_corr or s_manual:
                    dev_results.append((pn,)+these_results+(f_corr, s_manual))
                results.append((pn,)+these_results+(f_corr, s_manual))

    for check in (
        (False, 'blank'), 
        (True, 'filled'),
    ):
        if len(wrong_value[check[0]]): #illegal values in sheets
            wrongs = wrong_value[check[0]]
            for x in sorted(wrongs)[0:ERR_LIMIT]:
                px = T.words(L.d('word', x), fmt='ev')
                cx = T.words(L.d('word', L.u('clause', x)), fmt='ev')
                passage = T.passage(x)
                msg('ERROR: {} Illegal value(s) in {}: {} = {} in {}:'.format(
                    passage, check[1], x, px, cx
                ), withtime=False)
                for (verb, f, v) in wrongs[x]:
                    msg('\t"{}" is an illegal value for "{}" in verb {}'.format(
                        v, f, verb,
                    ), withtime=False)
            ne = len(wrongs)
            if ne > ERR_LIMIT: msg('... AND {} CASES MORE'.format(ne - ERR_LIMIT), withtime=False)
        else:
            inf('OK: The used {} enrichment sheets have legal values'.format(check[1]))

        nerrors = 0
        if len(repeated[check[0]]): # duplicates in sheets, check consistency
            repeats = repeated[check[0]]
            for x in sorted(repeats):
                overview = collections.defaultdict(list)
                for y in repeats[x]: overview[y[1]].append(y[0])
                px = T.words(L.d('word', x), fmt='ev')
                cx = T.words(L.d('word', L.u('clause', x)), fmt='ev')
                passage = T.passage(x)
                if len(overview) > 1:
                    nerrors += 1
                    if nerrors < ERR_LIMIT:
                        msg('ERROR: {} Conflict in {}: {} = {} in {}:'.format(
                            passage, check[1], x, px, cx
                        ), withtime=False)
                        for vals in overview:
                            msg('\t{:<40} in verb(s) {}'.format(
                                ', '.join(vals),
                                ', '.join(overview[vals]),
                        ), withtime=False)
                elif False: # for debugging purposes
                #else:
                    nerrors += 1
                    if nerrors < ERR_LIMIT:
                        inf('{} Agreement in {} {} = {} in {}: {}'.format(
                            passage, check[1], x, px, cx, ','.join(list(overview.values())[0]),
                        ), withtime=False)
            ne = nerrors
            if ne > ERR_LIMIT: msg('... AND {} CASES MORE'.format(ne - ERR_LIMIT), withtime=False)
        if nerrors == 0:
            inf('OK: The used {} enrichment sheets are consistent'.format(check[1]))

    if len(non_phrase):
        msg('ERROR: Enrichments have been applied to non-phrase nodes:')
        for x in sorted(non_phrase)[0:ERR_LIMIT]:
            px = T.words(L.d('word', x), fmt='ev')
            msg('{}: {} Node {} is not a phrase but a {}'.format(
                non_phrase[x], T.passage(x), x, F.otype.v(x),
            ), withtime=False)
        ne = len(non_phrase)
        if ne > ERR_LIMIT: msg('... AND {} CASES MORE'.format(ne - ERR_LIMIT), withtime=False)
    else:
        inf('OK: all enriched nodes where phrase nodes')

    if len(wrong_node):
        msg('ERROR: Node in filled sheet did not occur in blank sheet:')
        for x in sorted(wrong_node)[0:ERR_LIMIT]:
            px = T.words(L.d('word', x), fmt='ev')
            msg('{}: {} node {}'.format(
                non_phrase[x], T.passage(x), x,
            ), withtime=False)
        ne = len(wrong_node)
        if ne > ERR_LIMIT: msg('... AND {} CASES MORE'.format(ne - ERR_LIMIT), withtime=False)
    else:
        inf('OK: all enriched nodes occurred in the blank sheet')

    if len(dev_results):
        inf('OK: there are {} manual correction/enrichment annotations'.format(len(dev_results)))
        for r in dev_results[0:ERR_LIMIT]:
            (x, *vals, f_corr, s_manual) = r
            px = T.words(L.d('word', x), fmt='ev')
            cx = T.words(L.d('word', L.u('clause', x)), fmt='ev')
            inf('{:<30} {:>7} => {:<3} {:<3} {}\n\t{}\n\t\t{}'.format(
                'COR' if f_corr else '',
                'MAN' if s_manual else'',
                T.passage(x), x, ','.join(vals), px, cx
            ), withtime=False)
        ne = len(dev_results)
        if ne > ERR_LIMIT: inf('... AND {} ANNOTATIONS MORE'.format(ne - ERR_LIMIT), withtime=False)
    else:
        msg('WARNING: there are no manual correction/enrichment annotations')
    return results

corr = ExtraData(API)
corr.deliver_annots(
    'complements', 
    {'title': 'Verb complement enrichments', 'date': '2016-06'},
    [
        (None, 'complements', read_enrich, tuple(
            ('JanetDyk', 'ft', fname) for fname in list(enrich_fields.keys())+['f_correction', 's_manual']
        ))
    ],
)

stats = collections.Counter()
for (p, times) in phrases_seen.items(): stats[times] += 1
for (times, n) in sorted(stats.items(), key=lambda y: (-y[1], y[0])):
    inf('{:<6} phrases seen {:<2} time(s)'.format(n, times))
inf('Total phrases seen: {}'.format(len(phrases_seen)))

21m 12s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/oBR_etcbc4b.csv
21m 12s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/oFH_etcbc4b.csv
21m 12s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/oLH_etcbc4b.csv
21m 12s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/BRa_etcbc4b.csv
21m 12s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/BWa_etcbc4b.csv
21m 12s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/CJT_etcbc4b.csv
21m 12s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/CWB_etcbc4b.csv
21m 12s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/FJM_etcbc4b.csv
21m 12s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/HLK_etcbc4b.csv
21m 12s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/JRD_etcbc4b.csv
21m 12s NO filled enrichments file /Users/dirk/Dropbox/SYNVAR/enrich_filled/JYa_

21m 12s OK: The used blank enrichment sheets have legal values
21m 12s OK: The used blank enrichment sheets are consistent
21m 12s OK: The used filled enrichment sheets have legal values


ERROR: Exodus 30:12 Conflict in filled: 632989 = K.IJ  in K.IJ TIF.@> >ET&RO>C B.:N;J&JIF:R@>;L LIP:QUD;JHEm :
	NA, NA, NA, location                     in verb(s) NF>
	NA, NA, NA, time                         in verb(s) PQD


21m 12s OK: all enriched nodes where phrase nodes
21m 12s OK: all enriched nodes occurred in the blank sheet
21m 12s OK: there are 344 manual correction/enrichment annotations
COR                                    => 2_Kings 22:6 726393 NA,NA,NA,NA
	LEX@R@CIJm W:LAB.ONIJm W:LAG.OD:RIJm 
		W:JIT.:NW. >OTOW L:<OF;J HAM.:L@>K@H LEX@R@CIJm W:LAB.ONIJm W:LAG.OD:RIJm 
COR                                    => Genesis 5:2 606420 complement,object,,
	Z@K@R W.N:Q;B@H 
		Z@K@R W.N:Q;B@H B.:R@>@m 
COR                                    => Psalms 89:48 799849 complement,object,,
	C.@W:> 
		<AL&MAH&C.@W:> B.@R@>T@ K@L&B.:N;J&>@D@m00

COR                                    => Genesis 45:25 621386 NA,NA,NA,NA
	>EREy K.:NA<An >EL&JA<:AQOB >:ABIJHEm00

		WAJ.@BO>W. >EREy K.:NA<An >EL&JA<:AQOB >:ABIJHEm00

COR                                    => Exodus 14:16 627778 complement,*,,
	B.:TOWk: HAJ.@m 
		W:J@BO>W. B:N;J&JIF:R@>;L B.:TOWk: HAJ.@m B.AJ.AB.@C@H00

COR                                    => Ex

# 6 Annox complements
We load the s into the LAF-Fabric API, in the process of which they will be compiled.

Note that we draw in the new annotations by specifying an *annox* called `complements` (the second argument of the `fabric.load` function).

Then we turn that data into LAF annotations. Every enrichment is stored in new features, 
with names specified above in ``enrich_fields``, 
with label `ft` and namespace `JanetDyk`.

In [50]:
API=fabric.load(source+version, 'complements', 'flow_corr', {
    "xmlids": {"node": False, "edge": False},
    "features": ('''
        oid otype
        sp vs lex
        function
        chapter verse
        function s_manual f_correction
    ''' + ' '.join(enrich_fields),
    '''
    '''),
    "prepare": prepare,
}, verbose='DETAIL')
exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.00s DETAIL: COMPILING m: UP TO DATE
  0.00s USING main  DATA COMPILED AT: 2015-11-02T15-08-56
  0.00s BEGIN COMPILE a: complements
  0.00s DETAIL: load main: X. [node]  -> 
  1.66s DETAIL: load main: X. [e]  -> 
  3.89s DETAIL: load main: G.node_anchor_min
  3.98s DETAIL: load main: G.node_anchor_max
  4.08s DETAIL: load main: G.node_sort
  4.18s DETAIL: load main: G.node_sort_inv
  4.83s DETAIL: load main: G.edges_from
  4.96s DETAIL: load main: G.edges_to
  5.10s LOGFILE=/Users/dirk/laf/laf-fabric-data/etcbc4b/bin/A/complements/__log__compile__.txt
  5.10s PARSING ANNOTATION FILES
  5.74s INFO: parsing complements.xml
  8.13s INFO: END PARSING
         0 good   regions  and     0 faulty ones
         0 linked nodes    and     0 unlinked ones
         0 good   edges    and     0 faulty ones
     52300 good   annots   and     0 faulty ones
    313800 good   features and     0 faulty ones
     52300 distinct xml identifiers

  8.13s MODELING RES

In [56]:
f = outfile('all.csv')
NALLFIELDS = 10
tpl = ('{};' * (NALLFIELDS - 1))+'{}\n'

inf('collecting phrases ...')
f.write(tpl.format(
    '-',
    '-',
    'passage',
    'verb(s) text',
    '-',
    '-',
    '-',
    '-',
    'clause text',
    'clause node',
))
f.write(tpl.format(
    'corrected',
    'enriched',
    'passage',
    '-',
    'valence',
    'grammatical',
    'lexical',
    'semantic',
    'phrase text',
    'phrase node',
))
i = 0
j = 0
c = 0
CHUNK_SIZE = 10000
for cn in sorted(clause_verb_selected):
    c += 1
    verbs = sorted(clause_verb_selected[cn])
    f.write(tpl.format(
        '',
        '',
        T.passage(cn),
        ' '.join(F.lex.v(verb) for verb in verbs),
        '',
        '',
        '',
        '',
        T.words(L.d('word', cn), fmt='ec').replace('\n', ' '),
        cn,
    ))
    for pn in L.d('phrase', cn):
        i += 1
        j += 1
        if j == CHUNK_SIZE:
            j = 0
            inf('{:>6} phrases in {:>5} clauses ...'.format(i, c))
        f.write(tpl.format(
            'COR' if F.f_correction.v(pn) == 'True' else '',
            'MAN' if F.s_manual.v(pn) == 'True' else '',
            T.passage(pn),
            '',
            F.valence.v(pn),
            F.grammatical.v(pn),
            F.lexical.v(pn),
            F.semantic.v(pn),
            T.words(L.d('word', pn), fmt='ec').replace('\n', ' '),
            pn,
        ))
f.close()
inf('{:>6} phrases in {:>5} clauses done'.format(i, c))

20m 40s collecting phrases ...
20m 41s  10000 phrases in  2910 clauses ...
20m 41s  20000 phrases in  5916 clauses ...
20m 42s  30000 phrases in  9055 clauses ...
20m 43s  40000 phrases in 12168 clauses ...
20m 43s  50000 phrases in 15374 clauses ...
20m 44s  52300 phrases in 16053 clauses done
