<a href="http://laf-fabric.readthedocs.org/en/latest/" target="_blank"><img align="left" src="images/laf-fabric-xsmall.png"/></a>
<a href="http://emdros.org" target="_blank"><img align="left" src="files/images/Emdros-xsmall.png"/></a>
<a href="http://www.persistent-identifier.nl/?identifier=urn%3Anbn%3Anl%3Aui%3A13-048i-71" target="_blank"><img align="left"src="images/etcbc4easy-small.png"/></a>
<a href="http://www.godgeleerdheid.vu.nl/etcbc" target="_blank"><img align="right" src="images/VU-ETCBC-xsmall.png"/></a>

# Flowchart for verbs: checks 

We carry out checks for the benefit of running verbal valence flow charts.
See the [flowchart](http://nbviewer.ipython.org/github/etcbc/laf-fabric-nbs/blob/master/valence/flowchart.ipynb) notebook.

# Firing up the engines

In [1]:
import sys
import collections

import laf
from laf.fabric import LafFabric
from etcbc.preprocess import prepare
fabric = LafFabric()

  0.00s This is LAF-Fabric 4.5.3
API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html
Feature doc: http://shebanq-doc.readthedocs.org/en/latest/texts/welcome.html



# Loading the feature data

In [2]:
version = '4b'
API = fabric.load('etcbc{}'.format(version), 'lexicon', 'valence', {
    "xmlids": {"node": False, "edge": False},
    "features": ('''
        oid otype monads
        function rela
        g_word_utf8 trailer_utf8
        lex prs sp ls vs vt nametype det gloss
        book chapter verse label number
    ''',
    '''
        mother
    '''),
    "prepare": prepare,
    "primary": False,
}, verbose='DETAIL')
exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.00s DETAIL: COMPILING m: UP TO DATE
  0.10s INFO: USING DATA COMPILED AT: 2015-06-29T05-30-49
  0.10s DETAIL: COMPILING a: UP TO DATE
  0.75s INFO: USING DATA COMPILED AT: 2015-05-04T14-07-34
  0.76s DETAIL: load main: G.node_anchor_min
  0.82s DETAIL: load main: G.node_anchor_max
  0.87s DETAIL: load main: G.node_sort
  0.93s DETAIL: load main: G.node_sort_inv
  1.36s DETAIL: load main: G.edges_from
  1.43s DETAIL: load main: G.edges_to
  1.50s DETAIL: load main: F.etcbc4_db_monads [node] 
  2.28s DETAIL: load main: F.etcbc4_db_oid [node] 
  3.24s DETAIL: load main: F.etcbc4_db_otype [node] 
  4.05s DETAIL: load main: F.etcbc4_ft_det [node] 
  4.31s DETAIL: load main: F.etcbc4_ft_function [node] 
  4.46s DETAIL: load main: F.etcbc4_ft_g_word_utf8 [node] 
  4.83s DETAIL: load main: F.etcbc4_ft_lex [node] 
  5.07s DETAIL: load main: F.etcbc4_ft_ls [node] 
  5.30s DETAIL: load main: F.etcbc4_ft_number [node] 
  5.90s DETAIL: load main: F.etcbc4_f

# Parameters

Here we specify details of the flow chart process such as which lexemes to look for, which senses they have, etc.

In [3]:
predicates = {'Pred', 'PreS', 'PreO', 'PtcO', 'PreC'}

# Determine the target clauses

The target clauses are those clauses that contain the verb NTN in the Qal in a phrase with function ``Pred`` and friends.
But first we do some checks:

* how many verbs clauses can have in predicates
* examine in what phrases and tenses the selected verbs occurs in the Qal;

In [4]:
msg('Examining verbs in predicates in clauses')
verb_dist_clause = collections.defaultdict(lambda: [])
verb_dist_clause_atom = collections.defaultdict(lambda: [])
for p in F.otype.s('phrase'):
    pf = F.function.v(p)
    if pf not in predicates: continue
    c = L.u('clause', p)
    ca = L.u('clause_atom', p)
    for w in L.d('word', p):
        if F.sp.v(w) != 'verb': continue
        verb_dist_clause[c].append(F.lex.v(w))
        verb_dist_clause_atom[ca].append(F.lex.v(w))
msg('Done')
nverbc = len(verb_dist_clause)
nverbca = len(verb_dist_clause_atom)

print('{} clauses have a predicate with a verb'.format(nverbc))
print('{} clause_atoms have a predicate with a verb'.format(nverbca))

multiples_c = 0
onelex_c = 0
for c in verb_dist_clause:
    lexes = verb_dist_clause[c]
    if len(lexes) == 1: continue
    multiples_c += 1
    if len(set(lexes)) == 1: onelex_c += 1
print('{} clauses have multiple verb occurrences of which {} are still single lexeme'.format(
        multiples_c, onelex_c,
))
multiples_ca = 0
onelex_ca = 0
for ca in verb_dist_clause_atom:
    lexes = verb_dist_clause_atom[ca]
    if len(lexes) == 1: continue
    multiples_ca += 1
    if len(set(lexes)) == 1: onelex_ca += 1
print('{} clause_atoms have multiple verb occurrences of which {} are still single lexeme'.format(
        multiples_ca, onelex_ca,
))

 1m 02s Examining verbs in predicates in clauses
 1m 05s Done


69424 clauses have a predicate with a verb
69424 clause_atoms have a predicate with a verb
330 clauses have multiple verb occurrences of which 1 are still single lexeme
330 clause_atoms have multiple verb occurrences of which 1 are still single lexeme


We can stick to clauses, we gain nothing by considering clause_atoms instead.

We need to look more precisely in what happens in clauses with multiple verbs in a predicate.
Are there multiples predicates in those clauses, or are there multiple verbs in the single predicate.

In [5]:
msg('Examining verbs in predicates in clauses (revisited)')

def known_case(vs):
    be = {'HJH[', 'HWH['}
    if len(vs) != 2: return False
    ps = collections.defaultdict(lambda: set())
    for (p,w) in vs: ps[F.function.v(p)].add(F.lex.v(w))
    if len(ps['Pred'] | ps['PreS']) == 1 and len(be & (ps['Pred'] | ps['PreS'])) != 0: return True
    return False

verb_dist = collections.defaultdict(lambda: [])
for p in F.otype.s('phrase'):
    pf = F.function.v(p)
    if pf not in predicates: continue
    c = L.u('clause', p)
    for w in L.d('word', p):
        if F.sp.v(w) != 'verb': continue
        verb_dist[c].append((p, w))
msg('Done')

of = outfile('verb_dist.txt')
multiple = 0
good = 0
for c in verb_dist:
    vs = verb_dist[c]
    if len(vs) == 1: continue
    multiple += 1
    if known_case(vs):
        good += 1
        continue
    of.write('\n{} {}:{}#{}_{}\n'.format(
        F.book.v(L.u('book', c)),
        F.chapter.v(L.u('chapter', c)),
        F.verse.v(L.u('verse', c)),
        F.number.v(L.u('sentence', c)),
        F.number.v(c),
    ))
    for (p,w) in vs:
        of.write('\t{} {} has {}\n'.format(p, F.function.v(p), F.lex.v(w)))
of.close()

msg('''
{:>5} single verb clauses
{:>5} known multiple cases
{:>5} unknown multiple verb clauses
{:>5} clauses in total'''.format(
    len(verb_dist) - multiple,
    good,
    multiple - good,
    len(verb_dist),
))

 1m 40s Examining verbs in predicates in clauses (revisited)
 1m 42s Done
 1m 42s 
69094 single verb clauses
  281 known multiple cases
   49 unknown multiple verb clauses
69424 clauses in total


In [6]:
target_lexemes = {'NTN[', 'FJM[', 'BR>['}
msg('Examining the occurrences of {}'.format(', '.join(sorted(target_lexemes))))
qal_dist_all = collections.Counter()
qal_dist = collections.defaultdict(lambda: collections.Counter())
for w in F.otype.s('word'):
    lex = F.lex.v(w)
    if lex in target_lexemes and F.vs.v(w) == 'qal':
        wt = F.vt.v(w)
        wf = F.function.v(L.u('phrase', w))
        qal_dist_all[(wf,wt)] += 1
        qal_dist[lex][(wf,wt)] += 1
msg('Done')
tot = 0
for (label, n) in sorted(qal_dist_all.items(), key=lambda y: (-y[1], y[0])):
    tot += n
    print('{:<4} {:<4} {:>5} x'.format(label[0], label[1], n))
print('Total     {:>5} x'.format(tot))
for lx in sorted(qal_dist):
    print(lx)
    tot = 0
    for (label, n) in sorted(qal_dist[lx].items(), key=lambda y: (-y[1], y[0])):
        tot += n
        print('     {:<4} {:<4} {:>5} x'.format(label[0], label[1], n))
    print('     Total     {:>5} x'.format(tot))

 2m 51s Examining the occurrences of BR>[, FJM[, NTN[
 2m 54s Done


Pred perf   831 x
Pred impf   471 x
Pred wayq   423 x
Pred infc   159 x
PreO perf   158 x
PreC ptca   131 x
Pred impv   130 x
PreO wayq    85 x
PreO impf    65 x
PreS infc    25 x
PreC ptcp    15 x
PreO infc    14 x
Modi infa     8 x
Pred infa     8 x
PreO impv     6 x
PtcO ptca     4 x
Cmpl ptca     1 x
Objc ptca     1 x
Objc ptcp     1 x
Subj ptca     1 x
Total      2537 x
BR>[
     Pred perf    13 x
     PreC ptca    11 x
     PreO perf     7 x
     Pred wayq     2 x
     Objc ptca     1 x
     Pred impf     1 x
     Pred impv     1 x
     Pred infc     1 x
     Subj ptca     1 x
     Total        38 x
FJM[
     Pred perf   146 x
     Pred wayq   139 x
     Pred impf   104 x
     Pred impv    54 x
     PreO perf    39 x
     Pred infc    31 x
     PreO impf    21 x
     PreC ptca    15 x
     PreO wayq    14 x
     PreS infc     5 x
     PreC ptcp     4 x
     Modi infa     2 x
     PreO impv     2 x
     PreO infc     2 x
     Pred infa     1 x
     Total       579 x
NTN[
     Pred

Decision: we restrict to cases where the verb occurs as *predicate*, i.e. in a phrase with function as mentioned in the parameter ``predicates`` defined above.

## Exploring the object phrases

It is important to know whether a direct object has >T or not.
If it has >T, it is usually at the beginning, although some adverbial words might precede it.
So, what can we expect?
Here we explore everything that may precede the first >T in a phrase with function ``Objc``.

In [7]:
msg('Exploring >T prefixes')
etprefixes = collections.Counter()
etprefix_words = collections.Counter()
for p in F.otype.s('phrase'):
    if F.function.v(p) != 'Objc': continue
    prefix = []
    for w in L.d('word', p):
        if F.lex.v(w) == '>T':
            found = True
            break
        prefix.append(w)
    if found:
        prefstr = '-'.join(F.sp.v(w) for w in prefix)
        etprefixes[prefstr] += 1
        for w in prefix:
            etprefix_words[F.lex.v(w)]+= 1 
msg('Done')
for x in sorted(etprefix_words.items(), key=lambda y: (-y[1], y[0]))[0:10]:
    print('{:<20}: {:>5} x'.format(x[0], x[1]))

 3m 27s Exploring >T prefixes
 3m 30s Done


W                   :  1826 x
H                   :  1371 x
KL/                 :   535 x
BN/                 :   288 x
DBR/                :   258 x
MH                  :   223 x
BJT/                :   213 x
JD/                 :   212 x
>JC/                :   207 x
PNH/                :   198 x


There are many cases where material in a direct object phrase precedes the object marker.
Explore them with
[Dirk Roorda: Material before אֶת](https://shebanq.ancient-data.org/hebrew/query?version=4b&id=878).
For the moment we use a simple criterion: an direct object is an marked object if the object marker occurs *somewhere* in the phrase.

## Exploring the object clauses

We want to know more about object clauses.
Do they have a mother, and if yes, how many, of which type?

In [8]:
msg('Exploring the mothers of Objc-clauses')
noc = 0
nc = 0
mothers = collections.Counter()
has_mothers = collections.Counter()
for c in F.otype.s('clause'):
    nc += 1
    if F.rela.v(c) == 'Objc':
        noc += 1
        nmothers = 0
        for x in C.mother.v(c):
            nmothers += 1
            motype = F.otype.v(x)
            mytype = motype
            if motype == 'phrase':
                mytype = 'phrase {}'.format(F.function.v(x))
            elif motype == 'clause':
                mytype = 'clause {}'.format(F.rela.v(x))
            mothers[mytype] += 1
        has_mothers[nmothers] += 1
msg('Done')
print('{} object clauses of total {}'.format(noc, nc))
totaln = 0
for otp in sorted(mothers):
    thisn = mothers[otp]
    totaln += thisn
    print('{:<16}: {:>4}x'.format(otp, thisn))
print('Total {} mothers'.format(totaln))
totaln = 0
for x in sorted(has_mothers, reverse=True):
    thisn = has_mothers[x]
    if x != 0: totaln += thisn
    print('# mothers = {:>2} for {:>4} object clauses'.format(x, thisn))
print('Total {} object clauses with a mother'.format(totaln))

 3m 38s Exploring the mothers of Objc-clauses
 3m 40s Done


1427 object clauses of total 87900
clause Adju     :   92x
clause Attr     :   25x
clause Cmpl     :    5x
clause CoVo     :   17x
clause Coor     :   30x
clause NA       : 1234x
clause Objc     :   11x
clause PreC     :    2x
clause Resu     :    9x
clause Subj     :    2x
Total 1427 mothers
# mothers =  1 for 1427 object clauses
Total 1427 object clauses with a mother


So, every object clause has exactly one mother, and it is always a clause.
I presume that the object clause acts as a direct object of the verb in the mother clause.

# Complement promotion

We want to promote some complements to (secundary) direct objects, in cases as

*I make you into **a great people** *.

The question is, can we do this rule-based, or do we have to manually add this information to the data we are working with?

Let us generate a list of all cases.

We are looking for a clause and a phrase in it with function ``Cmpl``.
The phrase should start with one of the lexemes ``K``, ``L``, not carrying a pronominal suffix.

There should be an other object in the sentence, but the object could be implied.
So, for the moment we do not make a restriction of it, but we highlight the presence of objects, relatives, and phrases starting with the lexeme ``MN``.

In [18]:
msg('Investigating promotion')
predicates = {'Pred', 'PreS', 'PreO', 'PtcO', 'PreC'}
objectsf = {'Objc', 'PreO', 'PtcO'}
no_prs = {'absent', 'n/a'}
prom_preps = {'K', 'L'}

of = outfile('promotion.csv')
fields = '''book chapter verse sentence clause verbs #objs #cands'''.strip().split()
of_fmt = '{}'+('\t{}' * (len(fields)-1))+'\n'
of.write(of_fmt.format(*fields))

ncands = collections.Counter()
nobjs = collections.Counter()
nclauses = 0

for c in F.otype.s('clause'):
    nclauses += 1
    verbs = []
    cws = L.d('word', c)
    cw1 = cws[0]
    for w in cws:
        if F.sp.v(w) == 'verb':
            verbs.append(F.lex.v(w))
    cands = []
    ps = L.d('phrase', c)
    for p in ps:
        if F.function.v(p) != 'Cmpl': continue
        ws = L.d('word', p)
        w_one = ws[0]
        w_lex = F.lex.v(w_one)
        w_prs = F.prs.v(w_one)
        if w_prs not in no_prs: continue
        if w_lex not in prom_preps: continue
        cands.append(p)
    nc = len(cands)
    if nc == 0: continue

    ncands[nc] += 1

    objects = []
    for p in ps:
        if F.function.v(p) in objectsf:
            objects.append(p)
    no = len(objects)
    if no != 0:
        nobjs[no] += 1

    of.write(of_fmt.format(
        F.book.v(L.u('book', cw1)),
        F.chapter.v(L.u('chapter', cw1)),
        F.verse.v(L.u('verse', cw1)),
        F.number.v(L.u('sentence', cw1)),
        F.number.v(L.u('clause', cw1)),
        ' '.join(verbs),
        no,
        nc,
    ))
of.close()
msg('Done')
print('{:<40}: {:>6}'.format('Total clauses', nclauses))
print('{:<40}: {:>6}'.format('with any candidates', sum(ncands.values())))
for nc in sorted(ncands, reverse=True):
    print('{:<40}: {:>6}'.format('with {:>2} candidates'.format(nc), ncands[nc]))
print('{:<40}: {:>6}'.format('with any objects', sum(nobjs.values())))
for no in sorted(nobjs, reverse=True):
    print('{:<40}: {:>6}'.format('with {:>2} objects'.format(no), nobjs[no]))

 1h 07m 49s Investigating promotion
 1h 07m 52s Done


Total clauses                           :  87900
with any candidates                     :   5141
with  3 candidates                      :      1
with  2 candidates                      :     41
with  1 candidates                      :   5099
with any objects                        :   1780
with  2 objects                         :     61
with  1 objects                         :   1719


# Implied objects

# The relativum

How many occurrences of >CR are there?

What do we know about their antecedents?

In [10]:
msg('Investigating relative clauses')
ashers = list(F.lex.s('>CR'))
msg('The word >CR occurs {} times'.format(len(ashers)))

aclauses = collections.OrderedDict()
multiples = collections.Counter()
for a in ashers:
    c = L.u('clause', a)
    m = list(C.mother.v(c))
    if m:
        if c in aclauses:
            multiples[c]  += 1
        else:
            aclauses[c] = m[0]

msg('There are {} ashers with a mother; {} have multiple mothers'.format(len(aclauses), len(multiples)))         
for (c, n) in sorted(multiples.items(), key=lambda y: -y[1]):
    print('Clause {} has {} ashers'.format(c, n))

mothertypes = collections.defaultdict(lambda: [])
for c in aclauses:
    mothertypes[F.otype.v(aclauses[c])].append(c)
    
for (mt, ms) in sorted(mothertypes.items(), key=lambda y: -len(y[1])):
    print('{:>5} clauses have mother of type {}'.format(len(ms), mt))

of = outfile('asher.txt')
for mtype in mothertypes:
    of.write('[{}]\n'.format(mtype))
    for c in mothertypes[mtype]:
        m = aclauses[c]
        pair = sorted([c, m], key=NK)
        w = L.d('word', c)[0]
        of.write('{:<20} {:>3}:{:>3}#{:>2}_{:>2} {:<6} {}\n'.format(
            F.book.v(L.u('book', w)),
            F.chapter.v(L.u('chapter', w)),
            F.verse.v(L.u('verse', w)),
            F.number.v(L.u('sentence', w)),
            F.number.v(c),
            F.otype.v(m),
            ''.join('C' if x == c else 'M' for x in pair),
        ))
of.close()

 5m 55s Investigating relative clauses
 5m 56s The word >CR occurs 5500 times
 5m 56s There are 5263 ashers with a mother; 0 have multiple mothers


 3513 clauses have mother of type phrase
 1173 clauses have mother of type clause
  577 clauses have mother of type word


# Additional explorations

Below we generate a list of nouns in order to spot the ones that refer to people.
A list of the spotted nouns is used above in the algorithm to separate locative complements from indirect objects.

We also need lists of lexemes that have something locative in them, or that refer to body parts.

In [3]:
cands = collections.defaultdict(lambda: collections.Counter())
verbs = collections.Counter()
glosses = {}
for w in F.otype.s('word'):
    ls = F.ls.v(w)
    sp = F.sp.v(w)
    if sp == 'subs':
        cands[ls][F.lex.v(w)] += 1
    elif sp == 'verb':
        verbs[F.lex.v(w)] += 1
    glosses[F.lex.v(w)] = F.gloss.v(w)

of = outfile('words.txt')
for ls in cands:
    of.write('[{}]\n'.format(ls))
    for cand in sorted(cands[ls]):
        of.write('{:<10} : {:>5} x {}\n'.format(cand, cands[ls][cand], glosses[cand]))
    of.write('\n')
of.close()
of = outfile('verbs.txt')
for vb in sorted(verbs):
    of.write('{:<10} : {:>5} x {}\n'.format(vb, verbs[vb], glosses[vb]))
of.write('\n')
of.close()