<a href="http://www.godgeleerdheid.vu.nl/etcbc" target="_blank"><img align="left" src="images/VU-ETCBC-xsmall.png"/></a>
<a href="http://laf-fabric.readthedocs.org/en/latest/" target="_blank"><img align="right" src="images/laf-fabric-xsmall.png"/></a>
<a href="http://www.persistent-identifier.nl/?identifier=urn%3Anbn%3Anl%3Aui%3A13-048i-71" target="_blank"><img align="right"src="images/etcbc4easy-small.png"/></a>

# Complement corrections


# 0. Introduction

Joint work of Dirk Roorda and Janet Dyk.

In order to do
[flowchart analysis](https://shebanq.ancient-data.org/shebanq/static/docs/tools/valence/flowchart.html)
on verbs, we need to correct some coding errors.

Because the flowchart assigns meanings to verbs depending on the number and nature of complements found in their context, it is important that the phrases in those clauses are labeled correctly, i.e. that the
[function](https://shebanq.ancient-data.org/shebanq/static/docs/featuredoc/features/comments/function.html)
feature for those phrases have the correct label.

# 1. Task
In this notebook we do the following tasks:

* generate correction sheets for selected verbs,
* transform the set of filled in correction sheets into an annotation package

Between the first and second task, the sheets will have been filled in by Janet with corrections.

The resulting annotation package offers the corrections as the value of a new feature, also called `function`, but now in the annotation space `JanetDyk` instead of `etcbc4`.

# 2. Implementation

Start the engines, and note the import of the `ExtraData` functionality from the `etcbc.extra` module.
This module can turn data with anchors into additional LAF annotations to the big ETCBC LAF resource.

In [1]:
import sys,os, collections
from copy import deepcopy

import laf
from laf.fabric import LafFabric
from etcbc.preprocess import prepare
from etcbc.extra import ExtraData

fabric = LafFabric()

  0.00s This is LAF-Fabric 4.5.25
API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html
Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html



In [2]:
source = 'etcbc'
version = '4b'

We instruct the API to load data.
Note that we ask for the XML identifiers, because `ExtraData` needs them to stitch the corrections into the LAF XML.

In [3]:
API = fabric.load(source+version, '--', 'flow_corr', {
    "xmlids": {"node": True, "edge": False},
    "features": ('''
        oid otype
        sp vs lex uvf
        function
        chapter verse
    ''',''),
    "prepare": prepare,
}, verbose='NORMAL')
exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.00s USING main  DATA COMPILED AT: 2015-11-02T15-08-56
  5.90s LOGFILE=/Users/dirk/laf/laf-fabric-output/etcbc4b/flow_corr/__log__flow_corr.txt
  5.90s INFO: LOADING PREPARED data: please wait ... 
  5.90s prep prep: G.node_sort
  6.00s prep prep: G.node_sort_inv
  6.52s prep prep: L.node_up
  9.82s prep prep: L.node_down
    16s prep prep: V.verses
    16s prep prep: V.books_la
    16s ETCBC reference: http://laf-fabric.readthedocs.org/en/latest/texts/ETCBC-reference.html
    18s INFO: LOADED PREPARED data
    18s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX lexicon FOR TASK flow_corr AT 2016-04-06T11-51-22


# 2.1 Locations

In [4]:
ln_base = 'https://shebanq.ancient-data.org/hebrew/text'
ln_tpl = '?book={}&chapter={}&verse={}'
ln_tweak = '&version=4b&mr=m&qw=n&tp=txt_tb1&tr=hb&wget=x&qget=v&nget=x'

annox_basedir = API['data_dir']
annox_subdir = 'cpl'
annox_dir = '{}/{}'.format(annox_basedir, annox_subdir)

home_dir = os.path.expanduser('~').replace('\\', '/')
base_dir = '{}/Dropbox/SYNVAR'.format(home_dir)
kinds = ('corr_blank', 'corr_filled', 'enrich_blank', 'enrich_filled')
kdir = {}
for k in kinds:
    kd = '{}/{}'.format(base_dir, k)
    kdir[k] = kd
    if not os.path.exists(kd):
        os.makedirs(kd)

def vfile(verb, kind):
    if kind not in kinds:
        msg('Unknown kind `{}`'.format(kind))
        return None
    return '{}/{}_{}{}.csv'.format(kdir[kind], verb.replace('>','a').replace('<', 'o'), source, version)

# 2.2 Domain
Here is the set of verbs that interest us.

In [70]:
verbs_initial = set('''
    CJT
    BR>
    QR>
'''.strip().split())

motion_verbs = set('''
    <BR
    <LH
    BW>
    CWB
    HLK
    JRD
    JY>
    NPL
    NWS
    SWR
'''.strip().split())

double_object_verbs = set('''
    NTN
    <FH
    FJM
'''.strip().split())

complex_qal_verbs = set('''
    NF>
    PQD
'''.strip().split())

verbs = verbs_initial | motion_verbs | double_object_verbs | complex_qal_verbs

# 2.3 Phrase function

We need to correct some values of the phrase function.
When we receive the corrections, we check whether they have legal values.
Here we look up the possible values.

In [71]:
legal_values = {}

In [72]:
legal_values['function'] = {F.function.v(p) for p in F.otype.s('phrase')}

We generate a list of occurrences of those verbs, organized by the lexeme of the verb.
We need some extra values, to indicate other coding errors.

In [73]:
error_values = dict(
    BoundErr='this phrase is part of another phrase and does not merit its own function value',
)

legal_values['function'] |= set(error_values)

In [74]:
msg('Finding occurrences')
occs = collections.defaultdict(list)
for n in F.otype.s('word'):
    if F.sp.v(n) != 'verb': continue
    lex = F.lex.v(n).rstrip('/=[')
    occs[lex].append(n)
msg('Done')
for verb in sorted(verbs):
    print('{} {:<5} occurrences'.format(verb, len(occs[verb])))

 1h 22m 54s Finding occurrences
 1h 22m 56s Done


<BR 556   occurrences
<FH 2629  occurrences
<LH 890   occurrences
BR> 54    occurrences
BW> 2570  occurrences
CJT 85    occurrences
CWB 1056  occurrences
FJM 609   occurrences
HLK 1554  occurrences
JRD 377   occurrences
JY> 1069  occurrences
NF> 656   occurrences
NPL 445   occurrences
NTN 2017  occurrences
NWS 159   occurrences
PQD 303   occurrences
QR> 883   occurrences
SWR 297   occurrences


# 3 Blank sheet generation
Generate correction sheets.
They are CSV files. Every row corresponds to a verb occurrence.
The fields per row are the node numbers of the clause in which the verb occurs, the node number of the verb occurrence, the text of the verb occurrence (in ETCBC transliteration, consonantal) a passage label (book, chapter, verse), and then 4 columns for each phrase in the clause:

* phrase node number
* phrase text (ETCBC translit consonantal)
* original value of the `function` feature
* corrected value of the `function` feature (generated as empty)

In [75]:
def gen_sheet(verb):
    rows = []
    fieldsep = ';'
    field_names = '''
        clause#
        word#
        passage
        link
        verb
        stem
    '''.strip().split()
    max_phrases = 0
    clauses_seen = set()
    for wn in occs[verb]:
        cln = L.u('clause', wn)
        if cln in clauses_seen: continue
        clauses_seen.add(cln)
        vn = L.u('verse', wn)
        bn = L.u('book', wn)
        ch = F.chapter.v(vn)
        vs = F.verse.v(vn)
        passage_label = '{} {}:{}'.format(T.book_name(bn, lang='en'), ch, vs)
        ln = ln_base+(ln_tpl.format(T.book_name(bn, lang='la'), ch, vs))+ln_tweak
        lnx = '''"=HYPERLINK(""{}""; ""link"")"'''.format(ln)
        vt = T.words([wn], fmt='ec').replace('\n', '')
        vstem = F.vs.v(wn)
        row = [cln, wn, passage_label, lnx, vt, vstem]
        phrases = L.d('phrase', cln)
        n_phrases = len(phrases)
        if n_phrases > max_phrases: max_phrases = n_phrases
        for pn in phrases:
            pt = T.words(L.d('word', pn), fmt='ec').replace('\n', '')
            pf = F.function.v(pn)
            row.extend((pn, pt, pf, ''))
        rows.append(row)
    for i in range(max_phrases):
        field_names.extend('''
            phr{i}#
            phr{i}_txt
            phr{i}_function
            phr{i}_corr
        '''.format(i=i+1).strip().split())
    filename = vfile(verb, 'corr_blank')
    row_file = open(filename, 'w')
    row_file.write('{}\n'.format(fieldsep.join(field_names)))
    for row in rows:
        row_file.write('{}\n'.format(fieldsep.join(str(x) for x in row)))
    row_file.close()
    msg('Generated correction sheet for verb {}'.format(filename))
    
for verb in verbs: gen_sheet(verb)    

 1h 23m 12s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/oFH_etcbc4b.csv
 1h 23m 12s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/HLK_etcbc4b.csv
 1h 23m 13s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/NPL_etcbc4b.csv
 1h 23m 13s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/CJT_etcbc4b.csv
 1h 23m 13s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/CWB_etcbc4b.csv
 1h 23m 13s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/JYa_etcbc4b.csv
 1h 23m 13s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/NFa_etcbc4b.csv
 1h 23m 13s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/FJM_etcbc4b.csv
 1h 23m 13s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_blank/PQD_etcbc4b.csv
 1h 23m 13s Generated correction sheet for verb /Users/dirk/Dropbox/SYNVAR/corr_bl

# 4 Processing corrections
We read the filled-in correction sheets and extract the correction data out of it.
We store the corrections in a dictionary keyed by the phrase node.
We check whether we get multiple corrections for the same phrase.

In [11]:
pf_corr = {}

def read_corr():
    function_values = legal_values['function']

    for verb in sorted(verbs):
        repeated = collections.defaultdict(list)
        non_phrase = set()
        illegal_fvalue = set()

        filename = vfile(verb, 'corr_filled')
        if not os.path.exists(filename):
            msg('NO file {}'.format(filename))
            continue
        else:
            inf('Processing {}'.format(filename))
        with open(filename) as f:
            header = f.__next__()
            for line in f:
                fields = line.rstrip().split(';')
                for i in range(1, len(fields)//4):
                    (pn, pc) = (fields[2+4*i], fields[2+4*i+3])
                    if pn != '':
                        pc = pc.strip()
                        pn = int(pn)
                        if pc != '':
                            good = True
                            for i in [1]:
                                good = False
                                if pn in pf_corr:
                                    repeated[pn] += pc
                                    continue
                                if pc not in function_values:
                                    illegal_fvalue.add(pc)
                                    continue
                                if F.otype.v(pn) != 'phrase': 
                                    non_phrase.add(pn)
                                    continue
                                good = True
                            if good:
                                pf_corr[pn] = pc

        inf('{}: Found {:>5} corrections in {}'.format(verb, len(pf_corr), filename))
        if len(repeated):
            msg('ERROR: Some phrases have been corrected multiple times!')
            for x in sorted(repeated):
                msg('{:>6}: {}'.format(x, ', '.join(repeated[x])))
        else:
            inf('OK: Corrected phrases did not receive multiple corrections')
        if len(non_phrase):
            msg('ERROR: Corrections have been applied to non-phrase nodes: {}'.format(','.join(non_phrase)))
        else:
            inf('OK: all corrected nodes where phrase nodes')
        if len(illegal_fvalue):
            msg('ERROR: Some corrections supply illegal values for phrase function!')
            msg('`{}`'.format('`, `'.join(illegal_fvalue)))
        else:
            inf('OK: all corrected values are legal')
        
read_corr()

    40s NO file /Users/dirk/Dropbox/SYNVAR/corr_filled/oBR_etcbc4b.csv
    40s NO file /Users/dirk/Dropbox/SYNVAR/corr_filled/oFH_etcbc4b.csv
    40s NO file /Users/dirk/Dropbox/SYNVAR/corr_filled/oLH_etcbc4b.csv


    40s Processing /Users/dirk/Dropbox/SYNVAR/corr_filled/BRa_etcbc4b.csv
    40s BR>: Found     2 corrections in /Users/dirk/Dropbox/SYNVAR/corr_filled/BRa_etcbc4b.csv
    40s OK: Corrected phrases did not receive multiple corrections
    40s OK: all corrected nodes where phrase nodes
    40s OK: all corrected values are legal
    40s Processing /Users/dirk/Dropbox/SYNVAR/corr_filled/BWa_etcbc4b.csv
    40s BW>: Found    57 corrections in /Users/dirk/Dropbox/SYNVAR/corr_filled/BWa_etcbc4b.csv
    40s OK: Corrected phrases did not receive multiple corrections
    40s OK: all corrected nodes where phrase nodes
    40s OK: all corrected values are legal
    40s Processing /Users/dirk/Dropbox/SYNVAR/corr_filled/CJT_etcbc4b.csv
    40s CJT: Found    58 corrections in /Users/dirk/Dropbox/SYNVAR/corr_filled/CJT_etcbc4b.csv
    40s OK: Corrected phrases did not receive multiple corrections
    40s OK: all corrected nodes where phrase nodes
    40s OK: all corrected values are legal
    40s Pr

    40s NO file /Users/dirk/Dropbox/SYNVAR/corr_filled/FJM_etcbc4b.csv


    40s Processing /Users/dirk/Dropbox/SYNVAR/corr_filled/HLK_etcbc4b.csv
    40s HLK: Found   156 corrections in /Users/dirk/Dropbox/SYNVAR/corr_filled/HLK_etcbc4b.csv
    40s OK: Corrected phrases did not receive multiple corrections
    40s OK: all corrected nodes where phrase nodes
    40s OK: all corrected values are legal


    40s NO file /Users/dirk/Dropbox/SYNVAR/corr_filled/JRD_etcbc4b.csv
    40s NO file /Users/dirk/Dropbox/SYNVAR/corr_filled/JYa_etcbc4b.csv
    40s NO file /Users/dirk/Dropbox/SYNVAR/corr_filled/NPL_etcbc4b.csv
    40s NO file /Users/dirk/Dropbox/SYNVAR/corr_filled/NTN_etcbc4b.csv
    40s NO file /Users/dirk/Dropbox/SYNVAR/corr_filled/NWS_etcbc4b.csv


    40s Processing /Users/dirk/Dropbox/SYNVAR/corr_filled/QRa_etcbc4b.csv
    40s QR>: Found   159 corrections in /Users/dirk/Dropbox/SYNVAR/corr_filled/QRa_etcbc4b.csv
    40s OK: Corrected phrases did not receive multiple corrections
    40s OK: all corrected nodes where phrase nodes
    40s OK: all corrected values are legal


    40s NO file /Users/dirk/Dropbox/SYNVAR/corr_filled/SWR_etcbc4b.csv


# 5. Enrichment

We create blank sheets for new feature assignments, based on the corrected data.

In [12]:
enrich_field_spec = '''
valence
    adjunct
    complement
    core

grammatical
    *
    subject
    object
    copula
    copula+subject
    predication
    predication+subject
    predication+object

lexical
    location
    time

semantic
    benefactive
    time
    location
'''
enrich_fields = collections.OrderedDict()
cur_e = None
for line in enrich_field_spec.strip().split('\n'):
    if line.startswith(' '):
        enrich_fields.setdefault(cur_e, set()).add(line.strip())
    else:
        cur_e = line.strip()
if None in enrich_fields:
    msg('Invalid enrich field specification')
else:
    inf('Enrich field specification OK')
for ef in enrich_fields:
    print('{} = {{{}}}'.format(ef, ', '.join(sorted(enrich_fields[ef]))))

    44s Enrich field specification OK
valence = {adjunct, complement, core}
grammatical = {*, copula, copula+subject, object, predication, predication+object, predication+subject, subject}
lexical = {location, time}
semantic = {benefactive, location, time}


In [13]:
specs = '''
Adju	Adjunct	adjunct	NA		
Cmpl	Complement	complement	*		
Conj	Conjunction	NA	NA	NA	NA
EPPr	Enclitic personal pronoun	NA	copula		
ExsS	Existence with subject suffix	core	copula+subject		
Exst	Existence	core	copula		
Frnt	Fronted element	NA	NA	NA	NA
Intj	Interjection	NA	NA	NA	NA
IntS	Interjection with subject suffix	core	subject		
Loca	Locative	adjunct	NA	location	location
Modi	Modifier	NA	NA	NA	NA
ModS	Modifier with subject suffix	core	subject		
NCop	Negative copula	core	copula		
NCoS	Negative copula with subject suffix	core	copula+subject		
Nega	Negation	NA	NA	NA	NA
Objc	Object	complement	object		
PrAd	Predicative adjunct	adjunct	NA		
PrcS	Predicate complement with subject suffix	core	predication+subject		
PreC	Predicate complement	core	predication		
Pred	Predicate	core	predication		
PreO	Predicate with object suffix	core	predication+object		
PreS	Predicate with subject suffix	core	predication+subject		
PtcO	Participle with object suffix	core	predication+object		
Ques	Question	NA	NA	NA	NA
Rela	Relative	NA	NA	NA	NA
Subj	Subject	core	subject		
Supp	Supplementary constituent	adjunct	NA		benefactive
Time	Time reference	adjunct	NA	time	time
Unkn	Unknown	NA	NA	NA	NA
Voct	Vocative	NA	NA	NA	NA'''.strip().split('\n')

In [14]:
transform = {}
for line in specs:
    x = line.split('\t') 
    transform[x[0]] = dict(zip(enrich_fields, x[2:]))
for e in error_values:
    transform[e] = dict(zip(enrich_fields, ['NA']*4))

errors = 0
good = 0
for f in transform:
    for e in enrich_fields:
        val = transform[f][e]
        if val != '' and val != 'NA' and val not in enrich_fields[e]:
            msg('Defaults for `{}`: wrong `{}` value: "{}"'.format(f, e, val))
            errors += 1
        else: good += 1
if errors:
    msg('There were {} errors ({} good)'.format(errors, good))
else:
    inf('Defaults OK ({} good)'.format(good))

    48s Defaults OK (124 good)


In [15]:
ltpl = '{:<8}: {:<15} {:<20} {:<15} {:<15}'
print(ltpl.format('func', *enrich_fields))
for f in sorted(transform):
    sfs = transform[f]
    print(ltpl.format(f, *[sfs[sf] for sf in enrich_fields]))

func    : valence         grammatical          lexical         semantic       
Adju    : adjunct         NA                                                  
BoundErr: NA              NA                   NA              NA             
Cmpl    : complement      *                                                   
Conj    : NA              NA                   NA              NA             
EPPr    : NA              copula                                              
ExsS    : core            copula+subject                                      
Exst    : core            copula                                              
Frnt    : NA              NA                   NA              NA             
IntS    : core            subject                                             
Intj    : NA              NA                   NA              NA             
Loca    : adjunct         NA                   location        location       
ModS    : core            subject                   

## 5.1 Enrichment logic

For certain verbs and certain conditions, we can automatically fill in some of the new features.
For example, if the verb is `CJT`, and if an adjunct phrase is personal, starting with `L`, we know that the semantic role is *benefactive*.

In [16]:
locative_lexemes = set('''
>RY/ >YL/
<BR/ <BRH/ <BWR/ <C==/ <JR/ <L=/ <LJ=/ <LJH/ <LJL/ <MD=/ <MDH/ <MH/ <MQ/ <MQ===/ <QB/
BJT/
CM CMJM/ CMC/ C<R/
DRK/
FDH/
HR/
JM/ JRDN/ JRWCLM/ JFR>L/
MDBR/ MW<D/ MWL/ MZBX/ MYRJM/ MQWM/ MR>CWT/ MSB/ MSBH/ MVH==/
QDM/
SBJB/
TJMN/ TXT/ TXWT/
YPWN/
'''.strip().split())

personal_lexemes = set('''
>B/ >CH/ >DM/ >DRGZR/ >DWN/ >JC/ >J=/ >KR/ >LJL/ >LMN=/ >LMNH/ >LMNJ/ >LWH/ >LWP/ >M/ 
>MH/ >MN==/ >MWN=/ >NC/ >NWC/ >PH/ >PRX/ >SJR/ >SJR=/ >SP/ >X/ >XCDRPN/
>XWH/ >XWT/
<BDH=/ <CWQ/ <D=/ <DH=/ <LMH/ <LWMJM/ <M/ <MD/ <MJT/ <QR=/ <R/ <WJL/ <WL/ <WL==/ <WLL/
<WLL=/ <YRH/
B<L/ B<LH/ BKJRH/ BKR/ BN/ BR/ BR===/ BT/ BTWLH/ BWQR/ BXRJM/ BXWN/ BXWR/
CD==/ CDH/ CGL/ CKN/ CLCJM/ CLJC=/ CMRH=/ CPXH/ CW<R/ CWRR/
DJG/ DWD/ DWDH/ DWG/ DWR/
F<JR=/ FB/ FHD/ FR/ FRH/ FRJD/ FVN/
GBJRH/ GBR/ GBR=/ GBRT/ GLB/ GNB/ GR/ GW==/ GWJ/ GZBR/
HDBR/ 
J<RH/ JBM/ JBMH/ JD<NJ/ JDDWT/ JLD/ JLDH/ JLJD/ JRJB/ JSWR/ JTWM/ JWYR/
JYRJM/ 
KCP=/ KHN/ KLH/ KMR/ KN<NJ=/ KNT/ KRM=/ KRWB/ KRWZ/
L>M/ LHQH/ LMD/ LXNH/
M<RMJM/ M>WRH/ MCBR/ MCJX/ MCM<T/ MCMR/ MCPXH/ MCQLT/ MD<=/ MD<T/ MG/
MJNQT/ MKR=/ ML>K/ MLK/ MLKH/ MLKT/ MLX=/ MLYR/ MMZR/ MNZRJM/ MPLYT/
MPY=/ MQHL/ MQY<H/ MR</ MR>/ MSGR=/ MT/ MWRH/ MYBH=/
N<R/ N<R=/ N<RH/ N<RWT/ N<WRJM/ NBJ>/ NBJ>H/ NCJN/ NFJ>/ NGJD/ NJN/ NKD/ 
NKR/ NPC/ NPJLJM/ NQD/ NSJK/ NTJN/ 
PLGC/ PLJL/ PLJV/ PLJV=/ PQJD/ PR<H/ PRC/ PRJY/ PRJY=/ PRTMJM/ PRZWN/ 
PSJL/ PSL/ PVR/ PVRH/ PXH/ PXR/
QBYH/ QCRJM/ QCT=/ QHL/ QHLH/ QHLT/ QJM/ QYJN/
R<H=/ R<H==/ R<JH/ R<=/ R<WT/ R>H/ RB</ RB=/ RB==/ RBRBNJN/ RGMH/ RHB/ RKB=/
RKJL/ RMH/ RQX==/ 
SBL/ SPR=/ SRJS/ SRK/ SRNJM/ 
T<RWBWT/ TLMJD/ TLT=/ TPTJ/ TR<=/ TRCT>/ TRTN/ TWCB/ TWL<H/ TWLDWT/ TWTX/
VBX/ VBX=/ VBXH=/ VPSR/ VPXJM/
WLD/
XBL==/ XBL======/ XBR/ XBR=/ XBR==/ XBRH/ XBRT=/ XJ=/ XLC/ XM=/ XMWT/
XMWY=/ XNJK/ XR=/ XRC/ XRC====/ XRP=/ XRVM/ XTN/ XTP/ XZH=/
Y<JRH/ Y>Y>JM/ YJ/ YJD==/ YJR==/ YR=/ YRH=/ 
ZKWR/ ZMR=/ ZR</
'''.strip().split())

In [39]:
def has_L(vl, pn):
    words = L.d('word', pn)
    return len(words) > 0 and F.lex.v(words[0] == 'L')

def is_lex_personal(vl, pn):
    words = L.d('word', pn)
    return len(words) > 1 and F.lex.v(words[1] in personal_lexemes)

def is_lex_local(vl, pn):
    words = L.d('word', pn)
    return len({F.lex.v(w) for w in words} & locative_lexemes) > 0

def has_H_locale(vl, pn):
    words = L.d('word', pn)
    return len({w for w in words if F.uvf.v(w) == 'H'}) > 0  

In [40]:
enrich_logic = {
    'CJT': [
        (('semantic', 'benefactive'), ('function:Adju', has_L, is_lex_personal)),
        (('lexical', 'location'), ('function:Cmpl', has_H_locale)),
        (('lexical', 'location'), ('function:Cmpl', is_lex_local)),
    ],    
}

In [41]:
def rule_as_str(vl, i, sf, sfval, conditions):
    return '{}-{} {:<10} => {:<15} if {}'.format(
                    vl, i+1, sf, sfval,
                    ' AND '.join(
                        '{:<10} = {:<8}'.format(
                                *c.split(':')
                            ) if type(c) is str else '{:<15}'.format(
                                c.__name__
                            ) for c in conditions,
                    ),
    )

def check_logic():
    errors = 0
    rules = 0
    for vl in sorted(enrich_logic):
        for (i, ((sf, sfval), conditions)) in enumerate(enrich_logic[vl]):
            rules += 1
            inf(rule_as_str(vl, i, sf, sfval, conditions), withtime=False)
            if sf not in enrich_fields:
                msg('"{}" not a valid enrich field'.format(sf), withtime=False)
                errors += 1
            elif sfval not in enrich_fields[sf]:
                msg('`{}`: "{}" not a valid enrich field value'.format(sf, sfval), withtime=False)
                errors += 1
            for c in conditions:
                if type(c) == str:
                    x = c.split(':')
                    if len(x) != 2:
                        msg('Wrong feature condition {}'.format(c))
                        errors += 1
                    else:
                        (feat, val) = x
                        if feat not in legal_values:
                            msg('Feature `{}` not in use'.format(feat))
                            errors += 1
                        elif val not in legal_values[feat]:
                            msg('Feature `{}`: not a valid value "{}"'.format(feat, val))
                            errors += 1
    if errors:
        msg('There were {} errors in {} rules'.format(errors, rules))
    else:
        inf('All {} rules OK'.format(rules))

check_logic()

CJT-1 semantic   => benefactive     if function   = Adju     AND has_L           AND is_lex_personal
CJT-2 lexical    => location        if function   = Cmpl     AND has_H_locale   
CJT-3 lexical    => location        if function   = Cmpl     AND is_lex_local   
15m 42s All 3 rules OK


In [42]:
applied_cases = {}

def apply_logic(vl, pn, init_values):
    values = deepcopy(init_values)
    verb_rules = enrich_logic.get(vl, [])
    for (i, ((sf, sfval), conditions)) in enumerate(verb_rules):
        ok = True
        for condition in conditions:
            if type(condition) is str:
                (feature, value) = condition.split(':')
                this_ok = F.item[feature].v(pn) == value
            else:
                this_ok = condition(vl, pn)
            if not this_ok:
                ok = False
                break
        if ok:
            values[sf] = sfval
            applied_cases.setdefault(rule_as_str(vl, i, sf, sfval, conditions), []).append(pn)
    return tuple(values[sf] for sf in enrich_fields)

In [43]:
COMMON_FIELDS = '''
    cnode#
    vnode#
    pnode#
    book
    chapter
    verse
    link
    verb_lexeme
    verb_stem
    verb_occurrence
'''.strip().split()

CLAUSE_FIELDS = '''
    clause_text    
'''.strip().split()

PHRASE_FIELDS = '''
    phrase_text
    function
'''.strip().split() + list(enrich_fields)

field_names = []
for f in COMMON_FIELDS: field_names.append(f)
for i in range(max((len(CLAUSE_FIELDS), len(PHRASE_FIELDS)))):
    pf = PHRASE_FIELDS[i] if i < len(PHRASE_FIELDS) else '--'
    field_names.append(pf)
    
fillrows = len(CLAUSE_FIELDS) - len(PHRASE_FIELDS)
cfillrows = 0 if fillrows >= 0 else -fillrows
pfillrows = fillrows if fillrows >= 0 else 0
print('\n'.join(field_names))    

cnode#
vnode#
pnode#
book
chapter
verse
link
verb_lexeme
verb_stem
verb_occurrence
phrase_text
function
valence
grammatical
lexical
semantic


In [53]:
def gen_sheet_enrich(verb):
    rows = []
    fieldsep = ';'
    clauses_seen = set()
    for wn in occs[verb]:
        cl = L.u('clause', wn)
        if cl in clauses_seen: continue
        clauses_seen.add(cl)
        cn = L.u('clause', wn)
        vn = L.u('verse', wn)
        bn = L.u('book', wn)
        book = T.book_name(bn, lang='en')
        chapter = F.chapter.v(vn)
        verse = F.verse.v(vn)
        ln = ln_base+(ln_tpl.format(T.book_name(bn, lang='la'), chapter, verse))+ln_tweak
        lnx = '''"=HYPERLINK(""{}""; ""link"")"'''.format(ln)
        vl = F.lex.v(wn).rstrip('[=')
        vstem = F.vs.v(wn)
        vt = T.words([wn], fmt='ec').replace('\n', '')
        ct = T.words(L.d('word', cn), fmt='ec').replace('\n', '')
        
        common_fields = (cn, wn, -1, book, chapter, verse, lnx, vl, vstem, vt)
        clause_fields = (ct,)
        rows.append(common_fields + clause_fields + (('',)*cfillrows))
        for pn in L.d('phrase', cn):
            common_fields = (cn, wn, pn, book, chapter, verse, vl, vt)
            pt = T.words(L.d('word', pn), fmt='ec').replace('\n', '')
            pf = pf_corr.get(pn, None) or F.function.v(pn)
            phrase_fields = (pt, pf) + apply_logic(vl, pn, transform[pf])            
            rows.append(common_fields + phrase_fields + (('',)*pfillrows))
    filename = vfile(verb, 'enrich_blank')
    row_file = open(filename, 'w')
    row_file.write('{}\n'.format(fieldsep.join(field_names)))
    for row in rows:
        row_file.write('{}\n'.format(fieldsep.join(str(x) for x in row)))
    row_file.close()
    msg('Generated enrichment sheet for verb {} ({} rows)'.format(filename, len(rows)))
    
for verb in verbs: gen_sheet_enrich(verb)

msg('Done')
print('{} rules applied'.format(len(applied_cases)))
totaln = 0
for rule in applied_cases:
    cases = applied_cases[rule]
    n = len(cases)
    totaln += n
    print('{}\n\t{:>4} phrases: {}'.format(rule, n, ', '.join(str(c) for c in cases[0:10])))
print('{} applications in total'.format(totaln))

39m 25s Generated enrichment sheet for verb /Users/dirk/Dropbox/SYNVAR/enrich_blank/oFH_etcbc4b.csv (11048 rows)
39m 26s Generated enrichment sheet for verb /Users/dirk/Dropbox/SYNVAR/enrich_blank/oBR_etcbc4b.csv (2279 rows)
39m 26s Generated enrichment sheet for verb /Users/dirk/Dropbox/SYNVAR/enrich_blank/HLK_etcbc4b.csv (5672 rows)
39m 26s Generated enrichment sheet for verb /Users/dirk/Dropbox/SYNVAR/enrich_blank/NPL_etcbc4b.csv (1915 rows)
39m 26s Generated enrichment sheet for verb /Users/dirk/Dropbox/SYNVAR/enrich_blank/NWS_etcbc4b.csv (613 rows)
39m 27s Generated enrichment sheet for verb /Users/dirk/Dropbox/SYNVAR/enrich_blank/BWa_etcbc4b.csv (10817 rows)
39m 27s Generated enrichment sheet for verb /Users/dirk/Dropbox/SYNVAR/enrich_blank/CJT_etcbc4b.csv (375 rows)
39m 27s Generated enrichment sheet for verb /Users/dirk/Dropbox/SYNVAR/enrich_blank/CWB_etcbc4b.csv (4161 rows)
39m 27s Generated enrichment sheet for verb /Users/dirk/Dropbox/SYNVAR/enrich_blank/JYa_etcbc4b.csv (449

2 rules applied
CJT-1 semantic   => benefactive     if function   = Adju     AND has_L           AND is_lex_personal
	  10 phrases: 615130, 630648, 712015, 794512, 797440, 615130, 630648, 712015, 794512, 797440
CJT-3 lexical    => location        if function   = Cmpl     AND is_lex_local   
	  14 phrases: 606396, 619338, 630956, 654145, 776266, 789542, 797377, 606396, 619338, 630956
24 applications in total


In [59]:
def check_h(vl, show_results=False):
    hl = {}
    total = 0
    for w in F.otype.s('word'):
        if F.sp.v(w) != 'verb' or F.lex.v(w).rstrip('[=/') != vl: continue
        total += 1
        c = L.u('clause', w)
        ps = L.d('phrase', c)
        phs = {p for p in ps if len({w for w in L.d('word', p) if F.uvf.v(w) == 'H'}) > 0}
        for f in ('Cmpl', 'Adju', 'Loca'):
            phc = {p for p in ps if pf_corr.get(p, None) or F.function.v(p) == f}
            if len(phc & phs): hl.setdefault(f, set()).add(w)
    for f in hl:
        print('Verb {}: {} occurrences. He locales in {} phrases: {}'.format(vl, total, f, len(hl[f])))
        if show_results: print('\t{}'.format(', '.join(str(x) for x in hl[f])))
check_h('BW>', show_results=True)        

Verb BW>: 2570 occurrences. He locales in Adju phrases: 4
	154354, 322818, 154964, 75702
Verb BW>: 2570 occurrences. He locales in Loca phrases: 14
	90243, 93571, 29637, 284965, 289859, 136745, 257293, 289871, 154354, 154964, 9525, 257016, 284989, 93598
Verb BW>: 2570 occurrences. He locales in Cmpl phrases: 157
	26118, 26127, 146447, 187920, 197138, 272406, 95257, 184350, 398368, 289826, 201253, 24616, 78897, 401459, 100410, 32829, 100413, 198208, 5698, 200258, 100938, 24653, 141902, 112207, 186960, 24658, 196690, 28764, 34400, 298594, 248931, 132198, 162918, 12402, 5747, 146044, 396927, 153216, 134792, 151176, 188042, 97419, 426120, 257165, 136338, 21656, 162970, 200349, 214687, 24740, 257192, 158378, 100527, 25777, 160434, 214707, 4789, 4793, 272569, 139963, 90812, 249020, 38595, 113861, 138448, 8920, 282841, 19166, 20703, 26850, 43235, 145127, 8424, 8937, 170729, 397032, 254703, 154354, 200948, 426230, 176376, 79609, 165626, 206075, 208636, 27391, 269569, 106246, 157447, 26380, 149

## 5.1 Process the enrichments

We read the enrichments, perform some consistency checks, and produce an annotation package.

In [13]:
pf_enriched = set()
repeated = collections.defaultdict(list)
non_phrase = set()

def read_enrich(rootdir):
    results = []
    for verb in sorted(verbs):
        filename = '{}/{}'.format(rootdir, vfile(verb, 'enriched'))
        if not os.path.exists(filename):
            print('NO file {}'.format(filename))
            continue
        with open(filename) as f:
            header = f.__next__()
            for line in f:
                fields = line.rstrip().split(';')
                pn = int(fields[2])
                if pn < 0: continue
                vvals = tuple(fields[-4:])
                results.append((pn,)+vvals)
                if pn in pf_enriched:
                    repeated[pn] += vvals
                else:
                    pf_enriched.add(pn)
                if F.otype.v(pn) != 'phrase': 
                    non_phrase.add(pn)

        print('{}: Found {:>5} enrichments in {}'.format(verb, len(results), filename))
    if len(repeated):
        msg('ERROR: Some phrases have been enriched multiple times!')
        for x in sorted(repeated):
            print('{:>6}: {}'.format(x, ', '.join(repeated[x])))
    else:
        msg('OK: Enriched phrases did not receive multiple enrichments')
    if len(non_phrase):
        msg('ERROR: Enrichments have been applied to non-phrase nodes: {}'.format(','.join(non_phrase)))
    else:
        msg('OK: all enriched nodes where phrase nodes')
    print(results[0:10])
    return results

corr = ExtraData(API)
corr.deliver_annots(
    'complements', 
    {'title': 'Verb complement enrichments', 'date': '2016-03'},
    [
        ('cpl', 'complements', read_enrich, tuple(
            ('JanetDyk', 'ft', fname) for fname in enrich_fields
        ))
    ],
)

    30s OK: Enriched phrases did not receive multiple enrichments
    30s OK: all enriched nodes where phrase nodes


NO file /Users/dirk/surfdrive/laf-fabric-data/cpl/BRa_enriched_etcbc4b.csv
CJT: Found   291 enrichments in /Users/dirk/surfdrive/laf-fabric-data/cpl/CJT_enriched_etcbc4b.csv
NO file /Users/dirk/surfdrive/laf-fabric-data/cpl/QRa_enriched_etcbc4b.csv
[(605977, 'NA', 'NA', 'NA', 'NA'), (605978, 'complement', 'object', '', ''), (605979, 'core', 'predication', '', ''), (605980, 'complement', '*', '', ''), (606391, 'NA', 'NA', 'NA', 'NA'), (606392, 'core', 'predication', '', ''), (606393, 'adjunct', 'NA', '', ''), (606394, 'core', 'subject', '', ''), (606395, 'complement', 'object', '', ''), (606396, 'complement', '*', '', '')]


# 6 Annox complements
We load the s into the LAF-Fabric API, in the process of which they will be compiled.

Note that we draw in the new annotations by specifying an *annox* called `complements` (the second argument of the `fabric.load` function).

Then we turn that data into LAF annotations. Every enrichment is stored in new features, 
with names specified above in ``enrich_fields``, 
with label `ft` and namespace `JanetDyk`.

In [14]:
API=fabric.load(source+version, 'complements', 'flow_corr', {
    "xmlids": {"node": False, "edge": False},
    "features": ('''
        oid otype
        sp vs lex
        function
        chapter verse
        function
    ''' + ' '.join(enrich_fields),
    '''
    '''),
    "prepare": prepare,
}, verbose='DETAIL')
exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.00s DETAIL: COMPILING m: UP TO DATE
  0.00s USING main  DATA COMPILED AT: 2015-11-02T15-08-56
  0.00s BEGIN COMPILE a: complements
  0.00s DETAIL: load main: X. [node]  -> 
  1.44s DETAIL: load main: X. [e]  -> 
  4.00s DETAIL: load main: G.node_anchor_min
  4.11s DETAIL: load main: G.node_anchor_max
  4.22s DETAIL: load main: G.node_sort
  4.33s DETAIL: load main: G.node_sort_inv
  4.90s DETAIL: load main: G.edges_from
  5.03s DETAIL: load main: G.edges_to
  5.17s LOGFILE=/Users/dirk/surfdrive/laf-fabric-data/etcbc4b/bin/A/complements/__log__compile__.txt
  5.17s PARSING ANNOTATION FILES
  5.21s INFO: parsing complements.xml
  5.23s INFO: END PARSING
         0 good   regions  and     0 faulty ones
         0 linked nodes    and     0 unlinked ones
         0 good   edges    and     0 faulty ones
       291 good   annots   and     0 faulty ones
      1164 good   features and     0 faulty ones
       291 distinct xml identifiers

  5.23s MODELI