<a href="http://www.godgeleerdheid.vu.nl/etcbc" target="_blank"><img align="left" src="images/VU-ETCBC-xsmall.png"/></a>
<a href="http://laf-fabric.readthedocs.org/en/latest/" target="_blank"><img align="right" src="images/laf-fabric-xsmall.png"/></a>
<a href="http://www.persistent-identifier.nl/?identifier=urn%3Anbn%3Anl%3Aui%3A13-048i-71" target="_blank"><img align="right"src="images/etcbc4easy-small.png"/></a>

# Complement corrections


# 0. Introduction

Joint work of Dirk Roorda and Janet Dyk.

In order to do
[flowchart analysis](https://shebanq.ancient-data.org/shebanq/static/docs/tools/valence/flowchart.html)
on verbs, we need to correct some coding errors.

Because the flowchart assigns meanings to verbs depending on the number and nature of complements found in their context, it is important that the phrases in those clauses are labeled correctly, i.e. that the
[function](https://shebanq.ancient-data.org/shebanq/static/docs/featuredoc/features/comments/function.html)
feature for those phrases have the correct label.

# 1. Task
In this notebook we do the following tasks:

* generate correction sheets for selected verbs,
* transform the set of filled in correction sheets into an annotation package

Between the first and second task, the sheets will have been filled in by Janet with corrections.

The resulting annotation package offers the corrections as the value of a new feature, also called `function`, but now in the annotation space `JanetDyk` instead of `etcbc4`.

# 1. Implementation

Start the engines, and note the import of the `ExtraData` functionality from the `etcbc.extra` module.
This module can turn data with anchors into additional LAF annotations to the big ETCBC LAF resource.

In [1]:
import sys,os
import collections

import laf
from laf.fabric import LafFabric
from etcbc.preprocess import prepare
from etcbc.extra import ExtraData

fabric = LafFabric()

  0.00s This is LAF-Fabric 4.5.22
API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html
Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html



In [2]:
source = 'etcbc'
version = '4b'

We instruct the API to load data.
Note that we ask for the XML identifiers, because `ExtraData` needs them to stitch the corrections into the LAF XML.

In [3]:
API = fabric.load(source+version, '--', 'flow_corr', {
    "xmlids": {"node": True, "edge": False},
    "features": ('''
        oid otype
        sp vs lex
        function
        chapter verse
    ''',''),
    "prepare": prepare,
}, verbose='NORMAL')
exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.00s USING main  DATA COMPILED AT: 2015-11-02T15-08-56
  6.41s LOGFILE=/Users/dirk/Local/laf-fabric-output/etcbc4b/flow_corr/__log__flow_corr.txt
  6.41s INFO: LOADING PREPARED data: please wait ... 
  6.41s prep prep: G.node_sort
  6.52s prep prep: G.node_sort_inv
  7.00s prep prep: L.node_up
    10s prep prep: L.node_down
    15s prep prep: V.verses
    15s prep prep: V.books_la
    15s ETCBC reference: http://laf-fabric.readthedocs.org/en/latest/texts/ETCBC-reference.html
    17s INFO: LOADED PREPARED data
    17s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX lexicon FOR TASK flow_corr AT 2016-03-18T08-20-45


# 1.1 Domain
Here is the set of verbs that interest us.

In [4]:
verbs = set('''
    CJT
    BR>
    QR>
'''.strip().split())

We generate a list of occurrences of those verbs, organized by the lexeme of the verb.

In [5]:
msg('Finding occurrences')
occs = collections.defaultdict(list)
for n in F.otype.s('word'):
    lex = F.lex.v(n)
    if lex.endswith('['):
        lex = lex[0:-1]
        occs[lex].append(n)
msg('Done')
for verb in sorted(verbs):
    print('{} {:<5} occurrences'.format(verb, len(occs[verb])))

    24s Finding occurrences
    25s Done


BR> 48    occurrences
CJT 85    occurrences
QR> 743   occurrences


# 1.2 Blank sheet generation
Generate correction sheets.
They are CSV files. Every row corresponds to either a verb occurrence or to a phrase.

Every row contains several identification fields:
* node of the verb
* node of the clause of the verb
And in case of phrases, the node of the phrase is also added.
Every row also contains book, chapter, verse.

Since all rows have verb and clause node identifier, it is possible to sort all rows conveniently.

The other fields per row are dependent on the type of row.
* verb rows:
  * verb lexeme
  * verb occurrence
  * clause text
* phrase rows
  * phrase node
  * phrase text
  * phrase function
  * phrase function (corrected)
  * phrase verbal valence
  * phrase lexical characterisation
  * phrase grammatical relation
  * phrase semantic role

In [16]:
COMMON_FIELDS = '''
    cnode#
    vnode#
    pnode#
    book
    chapter
    verse
    verb_lexeme
    verb_occurrence
'''.strip().split()

CLAUSE_FIELDS = '''
    clause_text    
'''.strip().split()

PHRASE_FIELDS = '''
    phrase_text
    function
    function_(corr)
    valence
    lexical
    grammatical
    semantical
'''.strip().split()

field_names = []
for f in COMMON_FIELDS: field_names.append(f)
for i in range(max((len(CLAUSE_FIELDS), len(PHRASE_FIELDS)))):
    pf = PHRASE_FIELDS[i] if i < len(PHRASE_FIELDS) else '--'
    field_names.append(pf)
    
fillrows = len(CLAUSE_FIELDS) - len(PHRASE_FIELDS)
cfillrows = 0 if fillrows >= 0 else -fillrows
pfillrows = fillrows if fillrows >= 0 else 0
print('\n'.join(field_names))    

cnode#
vnode#
pnode#
book
chapter
verse
verb_lexeme
verb_occurrence
phrase_text
function
function_(corr)
valence
lexical
grammatical
semantical


In [18]:
def vfile(verb, kind): return '{}_{}_{}{}.csv'.format(verb.replace('>','a').replace('<', 'o'), kind, source, version)

def gen_sheet(verb):
    rows = []
    fieldsep = ';'
    for wn in occs[verb]:
        cn = L.u('clause', wn)
        vn = L.u('verse', wn)
        bn = L.u('book', wn)
        book = T.book_name(bn, lang='en')
        chapter = F.chapter.v(vn)
        verse = F.verse.v(vn)
        vl = F.lex.v(wn).rstrip('[=')
        vt = T.words([wn], fmt='ec').replace('\n', '')
        ct = T.words(L.d('word', cn), fmt='ec').replace('\n', '')
        
        common_fields = (cn, wn, -1, book, chapter, verse, vl, vt)
        clause_fields = (ct,)
        rows.append(common_fields + clause_fields + (('',)*cfillrows))
        for pn in L.d('phrase', cn):
            common_fields = (cn, wn, pn, book, chapter, verse, vl, vt)
            pt = T.words(L.d('word', pn), fmt='ec').replace('\n', '')
            pf = F.function.v(pn)
            phrase_fields = (pt, pf) + (('',)*5)
            rows.append(common_fields + phrase_fields + (('',)*pfillrows))
    filename = vfile(verb, 'blank')
    row_file = outfile(filename)
    row_file.write('{}\n'.format(fieldsep.join(field_names)))
    for row in rows:
        row_file.write('{}\n'.format(fieldsep.join(str(x) for x in row)))
    row_file.close()
    msg('Generated correction sheet for verb {}'.format(filename))
    
for verb in verbs: gen_sheet(verb)    

41m 23s Generated correction sheet for verb BRa_blank_etcbc4b.csv
41m 23s Generated correction sheet for verb CJT_blank_etcbc4b.csv
41m 23s Generated correction sheet for verb QRa_blank_etcbc4b.csv


# 1.3 Processing corrections
We read the filled-in correction sheets and extract the correction data out of it.
Then we turn that data into LAF annotations. Every correction is stored in a new feature, with name `function`, label `ft` and namespace `JanetDyk`.

In [91]:
def read_corr(rootdir):
    results = []
    for verb in verbs:
        filename = '{}/{}'.format(rootdir, vfile(verb, 'corrected'))
        if not os.path.exists(filename):
            print('NO file {}'.format(filename))
            continue
        with open(filename) as f:
            header = f.__next__()
            for line in f:
                fields = line.rstrip().split(';')
                for i in range(1, len(fields)//4):
                    (pn, pc) = (fields[4*i], fields[4*i+3])
                    pc = pc.strip()
                    if pc != '': results.append((int(pn), pc))
        print('{}: Found {:>5} corrections in {}'.format(verb, len(results), filename))
    return results

corr = ExtraData(API)
corr.deliver_annots(
    'complements', 
    {'title': 'Verb complement corrections', 'date': '2016-02'},
    [
        ('cpl', 'complements', read_corr, (
            ('JanetDyk', 'ft', 'function'),
        ))
    ],
)

CJT: Found     6 corrections in /Users/dirk/laf-fabric-data/cpl/CJT_corrected_etcbc4b.csv
NO file /Users/dirk/laf-fabric-data/cpl/QRa_corrected_etcbc4b.csv
NO file /Users/dirk/laf-fabric-data/cpl/BRa_corrected_etcbc4b.csv


# 2 Check the corrections

We load the corrections into the LAF-Fabric API, in the process of which they will be compiled.
We perform a few basic consistency checks.

# 2.1 Annox complements
Load the API again, but note that we draw in the new annotations by specifying an *annox* called `complements` (the second argument of the `fabric.load` function.

In [7]:
API=fabric.load(source+version, 'complements', 'flow_corr', {
    "xmlids": {"node": False, "edge": False},
    "features": ('''
        oid otype
        JanetDyk:ft.function etcbc4:ft.function
    ''',
    '''
    '''),
    "prepare": prepare,
}, verbose='NORMAL')
exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.00s USING main  DATA COMPILED AT: 2015-11-02T15-08-56
  0.00s USING annox DATA COMPILED AT: 2016-03-11T07-47-07
  1.04s INFO: Feature function refers to etcbc4_ft_function, not to JanetDyk_ft_function
  1.04s INFO: Feature ft_function refers to etcbc4_ft_function, not to JanetDyk_ft_function
  1.04s INFO: LOADING PREPARED data: please wait ... 
  1.04s ETCBC reference: http://laf-fabric.readthedocs.org/en/latest/texts/ETCBC-reference.html
  1.68s INFO: Feature function refers to etcbc4_ft_function, not to JanetDyk_ft_function
  1.68s INFO: Feature ft_function refers to etcbc4_ft_function, not to JanetDyk_ft_function
  3.11s INFO: Feature function refers to etcbc4_ft_function, not to JanetDyk_ft_function
  3.11s INFO: Feature ft_function refers to etcbc4_ft_function, not to JanetDyk_ft_function
  3.11s INFO: LOADED PREPARED data
  3.11s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX lexicon FOR TASK flow_corr AT 2016-03-14T14-21-18


# 2.2 Checks

We make sure that every correction applies to a node that corresponds to a phrase, and that no correction applies to the same phrase.

In [8]:
msg('Checking corrections')
corr = collections.Counter()
errors = collections.Counter()
corrected_nodes = set()
for n in NN():
    c = F.JanetDyk_ft_function.v(n)
    if c != None:
        if F.otype.v(n) != 'phrase': errors['Correction applied to non-phrase object'] += 1
        if n in corrected_nodes: errors['Phrase with multiple corrections'] += 1
        o = F.etcbc4_ft_function.v(n) or ''
        corr[(o, c)] += 1
        corrected_nodes.add(n)
    
msg('Found {} types of corrections'.format(len(corr)))
print(corr)
for ((o, c), n) in sorted(corr.items(), key=lambda x: (-x[1], x[0])):
    print('{:<5} => {:<5} {:>5} x'.format(o, c, n))
if not errors:
    print('NO ERRORS DETECTED')
else:
    print('THERE ARE ERRORS:')
    for (e, n) in sorted(errors.items(), key=lambda x: (-x[1], x[0])):
        print('{:>5} x {}'.format(n, e))

    16s Checking corrections
    17s Found 3 types of corrections


Counter({('Cmpl', 'Loca'): 3, ('Adju', 'Benf'): 2, ('Supp', 'Adju-Benf'): 1})
Cmpl  => Loca      3 x
Adju  => Benf      2 x
Supp  => Adju-Benf     1 x
NO ERRORS DETECTED


# Additional observations

## Phrases with function ``EPPr``

There are 9 occurrences, of which 8 in Aramaic text.
In all cases the phrase functions as a copula.

In [20]:
for p in F.otype.s('phrase'):
    if F.function.v(p) != 'EPPr': continue
    words = L.d('word', p)
    first_word = words[0]
    b = L.u('book', first_word)
    v = L.u('verse', first_word)
    c = L.u('clause', first_word)
    passage = '{} {}:{}'.format(
        T.book_name(b, lang='en'), 
        F.chapter.v(v),
        F.verse.v(v),
    )
    pt = T.words(L.d('word', p), fmt='ec').replace('\n', '')
    ct = T.words(L.d('word', c), fmt='ec').replace('\n', '')
    print('{} {} :: {}'.format(passage, pt, ct))

Daniel 2:9 HJ>  :: XDH&HJ> DTKWN 
Daniel 2:38 HW>  :: >NT&HW> R>#H DJ DHB>00
Daniel 2:47 HW>  :: DJ >LHKWN HW> >LH >LHJN WMR> MLKJN 
Daniel 3:15 HW>  :: WMN&HW> >LH 
Daniel 4:27 HJ>  :: HL> D>&HJ> BBL RBT> 
Daniel 5:13 HW>  :: >NT&HW> DNJ>L 
Daniel 6:17 HW>  :: >LHK HW> J#JZBNK00
Ezra 5:11 HMW  :: >NXN> HMW <BDWHJ DJ&>LH #MJ> W>R<> 
2_Chronicles 20:6 HW>  :: HL> >TH&HW> >LHJM B#MJM 
