<a href="http://www.godgeleerdheid.vu.nl/etcbc" target="_blank"><img align="left" src="images/VU-ETCBC-xsmall.png"/></a>
<a href="http://laf-fabric.readthedocs.org/en/latest/" target="_blank"><img align="right" src="images/laf-fabric-xsmall.png"/></a>
<a href="http://www.persistent-identifier.nl/?identifier=urn%3Anbn%3Anl%3Aui%3A13-048i-71" target="_blank"><img align="right"src="images/etcbc4easy-small.png"/></a>

# Complement corrections


# 0. Introduction

Joint work of Dirk Roorda and Janet Dyk.

In order to do
[flowchart analysis](https://shebanq.ancient-data.org/shebanq/static/docs/tools/valence/flowchart.html)
on verbs, we need to correct some coding errors.

Because the flowchart assigns meanings to verbs depending on the number and nature of complements found in their context, it is important that the phrases in those clauses are labeled correctly, i.e. that the
[function](https://shebanq.ancient-data.org/shebanq/static/docs/featuredoc/features/comments/function.html)
feature for those phrases have the correct label.

# 1. Task
In this notebook we do the following tasks:

* generate correction sheets for selected verbs,
* transform the set of filled in correction sheets into an annotation package

Between the first and second task, the sheets will have been filled in by Janet with corrections.

The resulting annotation package offers the corrections as the value of a new feature, also called `function`, but now in the annotation space `JanetDyk` instead of `etcbc4`.

# 1. Implementation

Start the engines, and note the import of the `ExtraData` functionality from the `etcbc.extra` module.
This module can turn data with anchors into additional LAF annotations to the big ETCBC LAF resource.

In [6]:
import sys,os
import collections

import laf
from laf.fabric import LafFabric
from etcbc.preprocess import prepare
from etcbc.extra import ExtraData

fabric = LafFabric()

  0.00s This is LAF-Fabric 4.5.21
API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html
Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html

25m 49s END


In [7]:
source = 'etcbc'
version = '4b'

We instruct the API to load data.
Note that we ask for the XML identifiers, because `ExtraData` needs them to stitch the corrections into the LAF XML.

In [67]:
API = fabric.load(source+version, '--', 'flow_corr', {
    "xmlids": {"node": True, "edge": False},
    "features": ('''
        oid otype
        sp vs lex
        function
        chapter verse
    ''',''),
    "prepare": prepare,
}, verbose='NORMAL')
exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.00s USING main  DATA COMPILED AT: 2015-11-02T15-08-56
  4.23s INFO: Feature ft_function refers to etcbc4_ft_function, not to JanetDyk_ft_function
  4.23s INFO: Feature function refers to etcbc4_ft_function, not to JanetDyk_ft_function
  4.24s INFO: LOADING PREPARED data: please wait ... 
  4.24s ETCBC reference: http://laf-fabric.readthedocs.org/en/latest/texts/ETCBC-reference.html
  4.79s INFO: Feature ft_function refers to etcbc4_ft_function, not to JanetDyk_ft_function
  4.79s INFO: Feature function refers to etcbc4_ft_function, not to JanetDyk_ft_function
  6.01s INFO: Feature ft_function refers to etcbc4_ft_function, not to JanetDyk_ft_function
  6.01s INFO: Feature function refers to etcbc4_ft_function, not to JanetDyk_ft_function
  6.01s INFO: LOADED PREPARED data
  6.01s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX lexicon FOR TASK flow_corr AT 2016-03-11T07-02-00


# 1.1 Domain
Here is the set of verbs that interest us.

In [68]:
verbs = set('''
    CJT
    BR>
    QR>
'''.strip().split())

We generate a list of occurrences of those verbs, organized by the lexeme of the verb.

In [70]:
msg('Finding occurrences')
occs = collections.defaultdict(list)
for n in F.otype.s('word'):
    lex = F.lex.v(n)
    if lex.endswith('['):
        lex = lex[0:-1]
        occs[lex].append(n)
msg('Done')
for verb in sorted(verbs):
    print('{} {:<5} occurrences'.format(verb, len(occs[verb])))

 1m 40s Finding occurrences
 1m 42s Done


BR> 48    occurrences
CJT 85    occurrences
QR> 743   occurrences


# 1.2 Blank sheet generation
Generate correction sheets.
They are CSV files. Every row corresponds to a verb occurrence.
The fields per row are the node numbers of the clause in which the verb occurs, the node number of the verb occurrence, the text of the verb occurrence (in ETCBC transliteration, consonantal) a passage label (book, chapter, verse), and then 4 columns for each phrase in the clause:

* phrase node number
* phrase text (ETCBC translit consonantal)
* original value of the `function` feature
* corrected value of the `function` feature (generated as empty)

In [71]:
def vfile(verb, kind): return '{}_{}_{}{}.csv'.format(verb.replace('>','a').replace('<', 'o'), kind, source, version)

def gen_sheet(verb):
    rows = []
    fieldsep = ';'
    field_names = '''
        clause#
        word#
        passage
        verb    
    '''.strip().split()
    max_phrases = 0
    for wn in occs[verb]:
        cln = L.u('clause', wn)
        vn = L.u('verse', wn)
        bn = L.u('book', wn)
        passage_label = '{} {}:{}'.format(T.book_name(bn, lang='en'), F.chapter.v(vn), F.verse.v(vn))
        vt = T.words([wn], fmt='ec').replace('\n', '')
        row = [cln, wn, passage_label, vt]
        phrases = L.d('phrase', cln)
        n_phrases = len(phrases)
        if n_phrases > max_phrases: max_phrases = n_phrases
        for pn in phrases:
            pt = T.words(L.d('word', pn), fmt='ec').replace('\n', '')
            pf = F.function.v(pn)
            row.extend((pn, pt, pf, ''))
        rows.append(row)
    for i in range(max_phrases):
        field_names.extend('''
            phr{i}#
            phr{i}_txt
            phr{i}_function
            phr{i}_corr
        '''.format(i=i+1).strip().split())
    filename = vfile(verb, 'blank')
    row_file = outfile(filename)
    row_file.write('{}\n'.format(fieldsep.join(field_names)))
    for row in rows:
        row_file.write('{}\n'.format(fieldsep.join(str(x) for x in row)))
    row_file.close()
    msg('Generated correction sheet for verb {}'.format(filename))
    
for verb in verbs: gen_sheet(verb)    

 8m 21s Generated correction sheet for verb CJT_blank_etcbc4b.csv
 8m 21s Generated correction sheet for verb QRa_blank_etcbc4b.csv
 8m 21s Generated correction sheet for verb BRa_blank_etcbc4b.csv


# 1.3 Processing corrections
We read the filled-in correction sheets and extract the correction data out of it.

In [50]:
def read_corr(rootdir):
    results = []
    for verb in verbs:
        filename = '{}/{}'.format(rootdir, vfile(verb, 'corrected'))
        if not os.path.exists(filename):
            print('NO file {}'.format(filename))
            continue
        with open(filename) as f:
            header = f.__next__()
            for line in f:
                fields = line.rstrip().split(';')
                for i in range(1, len(fields)//4):
                    (pn, pc) = (fields[4*i], fields[4*i+3])
                    pc = pc.strip()
                    if pc != '': results.append((int(pn), pc))
        print('{}: Found {:>5} corrections in {}'.format(verb, len(results), filename))
    return results

Then we turn that data into LAF annotations. Every correction is stored in a new feature, with name `function`, label `ft` and namespace `JanetDyk`.

In [72]:
corr = ExtraData(API)
corr.deliver_annots(
    'complements', 
    {'title': 'Verb complement corrections', 'date': '2016-02'},
    [
        ('cpl', 'complements', read_corr, (
            ('JanetDyk', 'ft', 'function'),
        ))
    ],
)

CJT: Found     6 corrections in /Users/dirk/laf-fabric-data/cpl/CJT_corrected_etcbc4b.csv
NO file /Users/dirk/laf-fabric-data/cpl/QRa_corrected_etcbc4b.csv
NO file /Users/dirk/laf-fabric-data/cpl/BRa_corrected_etcbc4b.csv


# 2 Check the corrections

We load the corrections into the LAF-Fabric API, in the process of which they will be compiled.
We perform a few basic consistency checks.

# 2.1 Annox complements
Load the API again, but note that we draw in the new annotations by specifying an *annox* called `complements` (the second argument of the `fabric.load` function.

In [59]:
API=fabric.load(source+version, 'complements', 'flow_corr', {
    "xmlids": {"node": False, "edge": False},
    "features": ('''
        oid otype
        JanetDyk:ft.function etcbc4:ft.function
    ''',
    '''
    '''),
    "prepare": prepare,
}, verbose='NORMAL')
exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.00s USING main  DATA COMPILED AT: 2015-11-02T15-08-56
  0.00s USING annox DATA COMPILED AT: 2016-03-11T06-31-06
  0.99s INFO: Feature ft_function refers to etcbc4_ft_function, not to JanetDyk_ft_function
  1.00s INFO: Feature function refers to etcbc4_ft_function, not to JanetDyk_ft_function
  1.00s INFO: LOADING PREPARED data: please wait ... 
  1.00s ETCBC reference: http://laf-fabric.readthedocs.org/en/latest/texts/ETCBC-reference.html
  1.56s INFO: Feature ft_function refers to etcbc4_ft_function, not to JanetDyk_ft_function
  1.56s INFO: Feature function refers to etcbc4_ft_function, not to JanetDyk_ft_function
  2.94s INFO: Feature ft_function refers to etcbc4_ft_function, not to JanetDyk_ft_function
  2.94s INFO: Feature function refers to etcbc4_ft_function, not to JanetDyk_ft_function
  2.95s INFO: LOADED PREPARED data
  2.95s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX lexicon FOR TASK flow_corr AT 2016-03-11T06-34-18


# 2.2 Checks

We make sure that every correction applies to a node that corresponds to a phrase, and that no correction applies to the same phrase.

In [73]:
msg('Checking corrections')
corr = collections.Counter()
errors = collections.Counter()
corrected_nodes = set()
for n in NN():
    c = F.JanetDyk_ft_function.v(n)
    o = F.etcbc4_ft_function.v(n)
    if c == None: continue
    corr[(o, c)] += 1
    if F.otype.v(n) != 'phrase': errors['Correction applied to non-phrase object'] += 1
    elif n in corrected_nodes: errors['Phrase with multiple corrections'] += 1
    corrected_nodes.add(n)
    
msg('Found {} types of corrections'.format(len(corr)))
for ((o, c), n) in sorted(corr.items(), key=lambda x: (-x[1], x[0])):
    print('{:<5} => {:<5} {:>5} x'.format(o, c, n))
if not errors:
    print('NO ERRORS DETECTED')
else:
    print('THERE ARE ERRORS:')
    for (e, n) in sorted(errors.items(), key=lambda x: (-x[1], x[0])):
        print('{:>5} x: {}'.format(n, e))

21m 35s Checking corrections
21m 38s Found 3 types of corrections


Cmpl  => Loca      3 x
Adju  => Benf      2 x
Supp  => Adju-Benf     1 x
NO ERRORS DETECTED
