<a href="http://www.godgeleerdheid.vu.nl/etcbc" target="_blank"><img align="left" src="images/VU-ETCBC-xsmall.png"/></a>
<a href="http://laf-fabric.readthedocs.org/en/latest/" target="_blank"><img align="right" src="images/laf-fabric-xsmall.png"/></a>
<a href="http://www.persistent-identifier.nl/?identifier=urn%3Anbn%3Anl%3Aui%3A13-048i-71" target="_blank"><img align="right"src="images/etcbc4easy-small.png"/></a>

# Complement corrections


# 0. Introduction

Joint work of Dirk Roorda and Janet Dyk.

In order to do
[flowchart analysis](https://shebanq.ancient-data.org/shebanq/static/docs/tools/valence/flowchart.html)
on verbs, we need to correct some coding errors.

Because the flowchart assigns meanings to verbs depending on the number and nature of complements found in their context, it is important that the phrases in those clauses are labeled correctly, i.e. that the
[function](https://shebanq.ancient-data.org/shebanq/static/docs/featuredoc/features/comments/function.html)
feature for those phrases have the correct label.

# 1. Task
In this notebook we do the following tasks:

* generate correction sheets for selected verbs,
* transform the set of filled in correction sheets into an annotation package

Between the first and second task, the sheets will have been filled in by Janet with corrections.

The resulting annotation package offers the corrections as the value of a new feature, also called `function`, but now in the annotation space `JanetDyk` instead of `etcbc4`.

# 1. Implementation

Start the engines, and note the import of the `ExtraData` functionality from the `etcbc.extra` module.
This module can turn data with anchors into additional LAF annotations to the big ETCBC LAF resource.

In [1]:
import sys,os
import collections

import laf
from laf.fabric import LafFabric
from etcbc.preprocess import prepare
from etcbc.extra import ExtraData

fabric = LafFabric()

  0.00s This is LAF-Fabric 4.5.23
API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html
Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html



In [2]:
source = 'etcbc'
version = '4b'

We instruct the API to load data.
Note that we ask for the XML identifiers, because `ExtraData` needs them to stitch the corrections into the LAF XML.

In [3]:
API = fabric.load(source+version, '--', 'flow_corr', {
    "xmlids": {"node": True, "edge": False},
    "features": ('''
        oid otype
        sp vs lex
        function
        chapter verse
    ''',''),
    "prepare": prepare,
}, verbose='NORMAL')
exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.00s USING main  DATA COMPILED AT: 2015-11-02T15-08-56
  5.65s LOGFILE=/Users/dirk/Local/laf-fabric-output/etcbc4b/flow_corr/__log__flow_corr.txt
  5.65s INFO: LOADING PREPARED data: please wait ... 
  5.65s prep prep: G.node_sort
  5.76s prep prep: G.node_sort_inv
  6.29s prep prep: L.node_up
  9.61s prep prep: L.node_down
    15s prep prep: V.verses
    15s prep prep: V.books_la
    15s ETCBC reference: http://laf-fabric.readthedocs.org/en/latest/texts/ETCBC-reference.html
    17s INFO: LOADED PREPARED data
    17s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX lexicon FOR TASK flow_corr AT 2016-03-23T12-11-53


# 1.1 Domain
Here is the set of verbs that interest us.

In [45]:
verbs = set('''
    CJT
    BR>
    QR>
'''.strip().split())

motion_verbs = set('''
    <BR
    <LH
    BW>
    CWB
    HLK
    JRD
    JY>
    NPL
    NWS
    SWR
'''.strip().split())

verbs |= motion_verbs

# 1.2 Phrase function

We need to correct some values of the phrase function.
When we receive the corrections, we check whether they have legal values.
Here we look up the possible values.

In [46]:
function_values = {F.function.v(p) for p in F.otype.s('phrase')}

We generate a list of occurrences of those verbs, organized by the lexeme of the verb.
We need some extra values, to indicate other coding errors.

In [47]:
error_values = dict(
    BoundErr='this phrase is part of another phrase and does not merit its own function value',
)

function_values |= set(error_values)
function_values

{'Adju',
 'BoundErr',
 'Cmpl',
 'Conj',
 'EPPr',
 'ExsS',
 'Exst',
 'Frnt',
 'IntS',
 'Intj',
 'Loca',
 'ModS',
 'Modi',
 'NCoS',
 'NCop',
 'Nega',
 'Objc',
 'PrAd',
 'PrcS',
 'PreC',
 'PreO',
 'PreS',
 'Pred',
 'PtcO',
 'Ques',
 'Rela',
 'Subj',
 'Supp',
 'Time',
 'Voct'}

In [48]:
msg('Finding occurrences')
occs = collections.defaultdict(list)
for n in F.otype.s('word'):
    lex = F.lex.v(n)
    if lex.endswith('['):
        lex = lex[0:-1]
        occs[lex].append(n)
msg('Done')
for verb in sorted(verbs):
    print('{} {:<5} occurrences'.format(verb, len(occs[verb])))

 1h 00m 59s Finding occurrences
 1h 01m 01s Done


<BR 548   occurrences
<LH 890   occurrences
BR> 48    occurrences
BW> 2570  occurrences
CJT 85    occurrences
CWB 1037  occurrences
HLK 1554  occurrences
JRD 377   occurrences
JY> 1069  occurrences
NPL 445   occurrences
NWS 159   occurrences
QR> 743   occurrences
SWR 297   occurrences


# 1.2 Blank sheet generation
Generate correction sheets.
They are CSV files. Every row corresponds to a verb occurrence.
The fields per row are the node numbers of the clause in which the verb occurs, the node number of the verb occurrence, the text of the verb occurrence (in ETCBC transliteration, consonantal) a passage label (book, chapter, verse), and then 4 columns for each phrase in the clause:

* phrase node number
* phrase text (ETCBC translit consonantal)
* original value of the `function` feature
* corrected value of the `function` feature (generated as empty)

In [49]:
def vfile(verb, kind): return '{}_{}_{}{}.csv'.format(verb.replace('>','a').replace('<', 'o'), kind, source, version)

ln_base = 'https://shebanq.ancient-data.org/hebrew/text'
ln_tpl = '?book={}&chapter={}&verse={}'
ln_tweak = '&version=4b&mr=m&qw=n&tp=txt_tb1&tr=hb&wget=x&qget=v&nget=x'

def gen_sheet(verb):
    rows = []
    fieldsep = ';'
    field_names = '''
        clause#
        word#
        passage
        link
        verb
        stem
    '''.strip().split()
    max_phrases = 0
    clauses_seen = set()
    for wn in occs[verb]:
        cln = L.u('clause', wn)
        if cln in clauses_seen: continue
        clauses_seen.add(cln)
        vn = L.u('verse', wn)
        bn = L.u('book', wn)
        ch = F.chapter.v(vn)
        vs = F.verse.v(vn)
        passage_label = '{} {}:{}'.format(T.book_name(bn, lang='en'), ch, vs)
        ln = ln_base+(ln_tpl.format(T.book_name(bn, lang='la'), ch, vs))+ln_tweak
        lnx = '''"=HYPERLINK(""{}""; ""link"")"'''.format(ln)
        vt = T.words([wn], fmt='ec').replace('\n', '')
        vstem = F.vs.v(wn)
        row = [cln, wn, passage_label, lnx, vt, vstem]
        phrases = L.d('phrase', cln)
        n_phrases = len(phrases)
        if n_phrases > max_phrases: max_phrases = n_phrases
        for pn in phrases:
            pt = T.words(L.d('word', pn), fmt='ec').replace('\n', '')
            pf = F.function.v(pn)
            row.extend((pn, pt, pf, ''))
        rows.append(row)
    for i in range(max_phrases):
        field_names.extend('''
            phr{i}#
            phr{i}_txt
            phr{i}_function
            phr{i}_corr
        '''.format(i=i+1).strip().split())
    filename = vfile(verb, 'corr')
    row_file = outfile(filename)
    row_file.write('{}\n'.format(fieldsep.join(field_names)))
    for row in rows:
        row_file.write('{}\n'.format(fieldsep.join(str(x) for x in row)))
    row_file.close()
    msg('Generated correction sheet for verb {}'.format(filename))
    
for verb in verbs: gen_sheet(verb)    

 1h 01m 07s Generated correction sheet for verb HLK_corr_etcbc4b.csv
 1h 01m 07s Generated correction sheet for verb CWB_corr_etcbc4b.csv
 1h 01m 07s Generated correction sheet for verb NWS_corr_etcbc4b.csv
 1h 01m 07s Generated correction sheet for verb JRD_corr_etcbc4b.csv
 1h 01m 07s Generated correction sheet for verb JYa_corr_etcbc4b.csv
 1h 01m 07s Generated correction sheet for verb oBR_corr_etcbc4b.csv
 1h 01m 07s Generated correction sheet for verb QRa_corr_etcbc4b.csv
 1h 01m 07s Generated correction sheet for verb CJT_corr_etcbc4b.csv
 1h 01m 08s Generated correction sheet for verb NPL_corr_etcbc4b.csv
 1h 01m 08s Generated correction sheet for verb BRa_corr_etcbc4b.csv
 1h 01m 08s Generated correction sheet for verb SWR_corr_etcbc4b.csv
 1h 01m 08s Generated correction sheet for verb oLH_corr_etcbc4b.csv
 1h 01m 08s Generated correction sheet for verb BWa_corr_etcbc4b.csv


# 1.3 Processing corrections
We read the filled-in correction sheets and extract the correction data out of it.
We store the corrections in a dictionary keyed by the phrase node.
We check whether we get multiple corrections for the same phrase.

In [34]:
pf_corr = {}
repeated = collections.defaultdict(list)
non_phrase = set()
illegal_fvalue = set()

annox_basedir = API['data_dir']
annox_subdir = 'cpl'
annox_dir = '{}/{}'.format(annox_basedir, annox_subdir)

def read_corr():
    for verb in sorted(verbs):
        filename = '{}/{}'.format(annox_dir, vfile(verb, 'corrected'))
        if not os.path.exists(filename):
            print('NO file {}'.format(filename))
            continue
        with open(filename) as f:
            header = f.__next__()
            for line in f:
                fields = line.rstrip().split(';')
                for i in range(1, len(fields)//4):
                    (pn, pc) = (fields[2+4*i], fields[2+4*i+3])
                    if pn != '':
                        pc = pc.strip()
                        pn = int(pn)
                        if pc != '':
                            good = True
                            for i in [1]:
                                good = False
                                if pn in pf_corr:
                                    repeated[pn] += pc
                                    continue
                                if pc not in function_values:
                                    illegal_fvalue.add(pc)
                                    continue
                                if F.otype.v(pn) != 'phrase': 
                                    non_phrase.add(pn)
                                    continue
                                good = True
                            if good:
                                pf_corr[pn] = pc

        print('{}: Found {:>5} corrections in {}'.format(verb, len(pf_corr), filename))
    if len(repeated):
        msg('ERROR: Some phrases have been corrected multiple times!')
        for x in sorted(repeated):
            print('{:>6}: {}'.format(x, ', '.join(repeated[x])))
    else:
        msg('OK: Corrected phrases did not receive multiple corrections')
    if len(non_phrase):
        msg('ERROR: Corrections have been applied to non-phrase nodes: {}'.format(','.join(non_phrase)))
    else:
        msg('OK: all corrected nodes where phrase nodes')
    if len(illegal_fvalue):
        msg('ERROR: Some corrections supply illegal values for phrase function!')
        print(illegal_fvalue)
    else:
        msg('OK: all corrected values are legal')
        
read_corr()

39m 49s OK: Corrected phrases did not receive multiple corrections
39m 49s OK: all corrected nodes where phrase nodes
39m 49s ERROR: Some corrections supply illegal values for phrase function!


BR>: Found     2 corrections in /Users/dirk/surfdrive/laf-fabric-data/cpl/BRa_corrected_etcbc4b.csv
CJT: Found     3 corrections in /Users/dirk/surfdrive/laf-fabric-data/cpl/CJT_corrected_etcbc4b.csv
QR>: Found     5 corrections in /Users/dirk/surfdrive/laf-fabric-data/cpl/QRa_corrected_etcbc4b.csv
{'spec'}


# 3. Enrichment

We create blank sheets for new feature assignments, based on the corrected data.

In [24]:
specs = '''function	description	valence	grammatical	lexical	semantic
Adju	Adjunct	adjunct	NA		
Cmpl	Complement	complement	*		
Conj	Conjunction	NA	NA	NA	NA
EPPr	Enclitic personal pronoun	NA	copula		
ExsS	Existence with subject suffix	core	copula+subject		
Exst	Existence	core	copula		
Frnt	Fronted element	NA	NA	NA	NA
Intj	Interjection	NA	NA	NA	NA
IntS	Interjection with subject suffix	core	subject		
Loca	Locative	adjunct	NA	location	location
Modi	Modifier	NA	NA	NA	NA
ModS	Modifier with subject suffix	core	subject		
NCop	Negative copula	core	copula		
NCoS	Negative copula with subject suffix	core	copula+subject		
Nega	Negation	NA	NA	NA	NA
Objc	Object	complement	object		
PrAd	Predicative adjunct	adjunct	NA		
PrcS	Predicate complement with subject suffix	core	predication+subject		
PreC	Predicate complement	core	predication		
Pred	Predicate	core	predication		
PreO	Predicate with object suffix	core	predication+object		
PreS	Predicate with subject suffix	core	predication+subject		
PtcO	Participle with object suffix	core	predication+object		
Ques	Question	NA	NA	NA	NA
Rela	Relative	NA	NA	NA	NA
Subj	Subject	core	subject		
Supp	Supplementary constituent	adjunct	NA		benefactive
Time	Time reference	adjunct	NA	time	time
Unkn	Unknown	NA	NA	NA	NA
Voct	Vocative	NA	NA	NA	NA'''.split('\n')

In [39]:
specfields = specs[0].split('\t')[2:]
transform = dict((x[0], tuple(x[2:])) for x in (y.split('\t') for y in specs[1:]))
for e in error_values:
    transform[e] = ('NA',)*4

In [41]:
print('{}\n{}'.format(specfields, '\n'.join('{}: {}'.format(*x) for x in sorted(transform.items()))))

['valence', 'grammatical', 'lexical', 'semantic']
Adju: ('adjunct', 'NA', '', '')
BoundErr: ('NA', 'NA', 'NA', 'NA')
Cmpl: ('complement', '*', '', '')
Conj: ('NA', 'NA', 'NA', 'NA')
EPPr: ('NA', 'copula', '', '')
ExsS: ('core', 'copula+subject', '', '')
Exst: ('core', 'copula', '', '')
Frnt: ('NA', 'NA', 'NA', 'NA')
IntS: ('core', 'subject', '', '')
Intj: ('NA', 'NA', 'NA', 'NA')
Loca: ('adjunct', 'NA', 'location', 'location')
ModS: ('core', 'subject', '', '')
Modi: ('NA', 'NA', 'NA', 'NA')
NCoS: ('core', 'copula+subject', '', '')
NCop: ('core', 'copula', '', '')
Nega: ('NA', 'NA', 'NA', 'NA')
Objc: ('complement', 'object', '', '')
PrAd: ('adjunct', 'NA', '', '')
PrcS: ('core', 'predication+subject', '', '')
PreC: ('core', 'predication', '', '')
PreO: ('core', 'predication+object', '', '')
PreS: ('core', 'predication+subject', '', '')
Pred: ('core', 'predication', '', '')
PtcO: ('core', 'predication+object', '', '')
Ques: ('NA', 'NA', 'NA', 'NA')
Rela: ('NA', 'NA', 'NA', 'NA')
Subj: ('

In [50]:
COMMON_FIELDS = '''
    cnode#
    vnode#
    pnode#
    book
    chapter
    verse
    link
    verb_lexeme
    verb_stem
    verb_occurrence
'''.strip().split()

CLAUSE_FIELDS = '''
    clause_text    
'''.strip().split()

PHRASE_FIELDS = '''
    phrase_text
    function
'''.strip().split() + specfields

field_names = []
for f in COMMON_FIELDS: field_names.append(f)
for i in range(max((len(CLAUSE_FIELDS), len(PHRASE_FIELDS)))):
    pf = PHRASE_FIELDS[i] if i < len(PHRASE_FIELDS) else '--'
    field_names.append(pf)
    
fillrows = len(CLAUSE_FIELDS) - len(PHRASE_FIELDS)
cfillrows = 0 if fillrows >= 0 else -fillrows
pfillrows = fillrows if fillrows >= 0 else 0
print('\n'.join(field_names))    

cnode#
vnode#
pnode#
book
chapter
verse
link
verb_lexeme
verb_stem
verb_occurrence
phrase_text
function
valence
grammatical
lexical
semantic


In [43]:
def gen_sheet_enrich(verb):
    rows = []
    fieldsep = ';'
    clauses_seen = set()
    for wn in occs[verb]:
        cl = L.u('clause', wn)
        if cl in clauses_seen: continue
        clauses_seen.add(cl)
        cn = L.u('clause', wn)
        vn = L.u('verse', wn)
        bn = L.u('book', wn)
        book = T.book_name(bn, lang='en')
        chapter = F.chapter.v(vn)
        verse = F.verse.v(vn)
        ln = ln_base+(ln_tpl.format(T.book_name(bn, lang='la'), chapter, verse))+ln_tweak
        lnx = '''"=HYPERLINK(""{}""; ""link"")"'''.format(ln)
        vl = F.lex.v(wn).rstrip('[=')
        vstem = F.vs.v(wn)
        vt = T.words([wn], fmt='ec').replace('\n', '')
        ct = T.words(L.d('word', cn), fmt='ec').replace('\n', '')
        
        common_fields = (cn, wn, -1, book, chapter, verse, lnx, vl, vstem, vt)
        clause_fields = (ct,)
        rows.append(common_fields + clause_fields + (('',)*cfillrows))
        for pn in L.d('phrase', cn):
            common_fields = (cn, wn, pn, book, chapter, verse, vl, vt)
            pt = T.words(L.d('word', pn), fmt='ec').replace('\n', '')
            pf = pf_corr.get(pn, None) or F.function.v(pn)
            phrase_fields = (pt, pf) + transform[pf]
            rows.append(common_fields + phrase_fields + (('',)*pfillrows))
    filename = vfile(verb, 'enrich')
    row_file = outfile(filename)
    row_file.write('{}\n'.format(fieldsep.join(field_names)))
    for row in rows:
        row_file.write('{}\n'.format(fieldsep.join(str(x) for x in row)))
    row_file.close()
    msg('Generated enrichment sheet for verb {}'.format(filename))
    
for verb in verbs: gen_sheet_enrich(verb)    

56m 23s Generated enrichment sheet for verb QRa_enrich_etcbc4b.csv
56m 23s Generated enrichment sheet for verb CJT_enrich_etcbc4b.csv
56m 23s Generated enrichment sheet for verb BRa_enrich_etcbc4b.csv


## 3.1 Process the enrichments

We read the enrichments, perform some consistency checks, and produce an annotation package.

In [13]:
pf_enriched = set()
repeated = collections.defaultdict(list)
non_phrase = set()

def read_enrich(rootdir):
    results = []
    for verb in sorted(verbs):
        filename = '{}/{}'.format(rootdir, vfile(verb, 'enriched'))
        if not os.path.exists(filename):
            print('NO file {}'.format(filename))
            continue
        with open(filename) as f:
            header = f.__next__()
            for line in f:
                fields = line.rstrip().split(';')
                pn = int(fields[2])
                if pn < 0: continue
                vvals = tuple(fields[-4:])
                results.append((pn,)+vvals)
                if pn in pf_enriched:
                    repeated[pn] += vvals
                else:
                    pf_enriched.add(pn)
                if F.otype.v(pn) != 'phrase': 
                    non_phrase.add(pn)

        print('{}: Found {:>5} enrichments in {}'.format(verb, len(results), filename))
    if len(repeated):
        msg('ERROR: Some phrases have been enriched multiple times!')
        for x in sorted(repeated):
            print('{:>6}: {}'.format(x, ', '.join(repeated[x])))
    else:
        msg('OK: Enriched phrases did not receive multiple enrichments')
    if len(non_phrase):
        msg('ERROR: Enrichments have been applied to non-phrase nodes: {}'.format(','.join(non_phrase)))
    else:
        msg('OK: all enriched nodes where phrase nodes')
    print(results[0:10])
    return results

corr = ExtraData(API)
corr.deliver_annots(
    'complements', 
    {'title': 'Verb complement enrichments', 'date': '2016-03'},
    [
        ('cpl', 'complements', read_enrich, tuple(
            ('JanetDyk', 'ft', fname) for fname in specfields
        ))
    ],
)

    30s OK: Enriched phrases did not receive multiple enrichments
    30s OK: all enriched nodes where phrase nodes


NO file /Users/dirk/surfdrive/laf-fabric-data/cpl/BRa_enriched_etcbc4b.csv
CJT: Found   291 enrichments in /Users/dirk/surfdrive/laf-fabric-data/cpl/CJT_enriched_etcbc4b.csv
NO file /Users/dirk/surfdrive/laf-fabric-data/cpl/QRa_enriched_etcbc4b.csv
[(605977, 'NA', 'NA', 'NA', 'NA'), (605978, 'complement', 'object', '', ''), (605979, 'core', 'predication', '', ''), (605980, 'complement', '*', '', ''), (606391, 'NA', 'NA', 'NA', 'NA'), (606392, 'core', 'predication', '', ''), (606393, 'adjunct', 'NA', '', ''), (606394, 'core', 'subject', '', ''), (606395, 'complement', 'object', '', ''), (606396, 'complement', '*', '', '')]


# 3.1 Annox complements
We load the s into the LAF-Fabric API, in the process of which they will be compiled.

Note that we draw in the new annotations by specifying an *annox* called `complements` (the second argument of the `fabric.load` function).

Then we turn that data into LAF annotations. Every enrichment is stored in new features, 
with names specified above in the variable ``specfields``, 
with label `ft` and namespace `JanetDyk`.

In [14]:
API=fabric.load(source+version, 'complements', 'flow_corr', {
    "xmlids": {"node": False, "edge": False},
    "features": ('''
        oid otype
        sp vs lex
        function
        chapter verse
        function
    ''' + ' '.join(specfields),
    '''
    '''),
    "prepare": prepare,
}, verbose='DETAIL')
exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.00s DETAIL: COMPILING m: UP TO DATE
  0.00s USING main  DATA COMPILED AT: 2015-11-02T15-08-56
  0.00s BEGIN COMPILE a: complements
  0.00s DETAIL: load main: X. [node]  -> 
  1.44s DETAIL: load main: X. [e]  -> 
  4.00s DETAIL: load main: G.node_anchor_min
  4.11s DETAIL: load main: G.node_anchor_max
  4.22s DETAIL: load main: G.node_sort
  4.33s DETAIL: load main: G.node_sort_inv
  4.90s DETAIL: load main: G.edges_from
  5.03s DETAIL: load main: G.edges_to
  5.17s LOGFILE=/Users/dirk/surfdrive/laf-fabric-data/etcbc4b/bin/A/complements/__log__compile__.txt
  5.17s PARSING ANNOTATION FILES
  5.21s INFO: parsing complements.xml
  5.23s INFO: END PARSING
         0 good   regions  and     0 faulty ones
         0 linked nodes    and     0 unlinked ones
         0 good   edges    and     0 faulty ones
       291 good   annots   and     0 faulty ones
      1164 good   features and     0 faulty ones
       291 distinct xml identifiers

  5.23s MODELI