<a href="http://www.godgeleerdheid.vu.nl/etcbc" target="_blank"><img align="left" src="images/VU-ETCBC-xsmall.png"/></a>
<a href="http://laf-fabric.readthedocs.org/en/latest/" target="_blank"><img align="right" src="images/laf-fabric-xsmall.png"/></a>
<a href="http://www.persistent-identifier.nl/?identifier=urn%3Anbn%3Anl%3Aui%3A13-048i-71" target="_blank"><img align="right"src="images/etcbc4easy-small.png"/></a>

# Complement Collection

This notebook collects the complements to the verb in each clause.

The purpose is to create a spreadsheet in which each row corresponds to a clause.
The first column is filled with the lexeme of the verb phrase of the clause, the next columns correspond to the various complements of the verb phrase in that clause.

<img align="right"src="images/Complements.png"/></a>

# Firing up the engines

In [1]:
import sys
import collections

import laf
from laf.fabric import LafFabric
from etcbc.preprocess import prepare
from etcbc.lib import Transcription, monad_set
from etcbc.trees import Tree

fabric = LafFabric()
tr = Transcription()

  0.00s This is LAF-Fabric 4.4.6
API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html
Feature doc: http://shebanq-doc.readthedocs.org/en/latest/texts/welcome.html



In [5]:
API = fabric.load('etcbc4b', '--', 'clausecomplements', {
    "xmlids": {"node": False, "edge": False},
    "features": ('''
        oid otype monads
        book chapter verse
        det sp lex function
        g_cons g_word trailer_utf8
    ''','''
    '''),
    "prepare": prepare,
}, verbose='NORMAL')
exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.01s INFO: USING DATA COMPILED AT: 2014-10-23T15-58-52
  7.89s INFO: DATA LOADED FROM SOURCE etcbc4s AND ANNOX -- FOR TASK clausecomplements AT 2015-02-18T13-14-44


# Tree Construction

We construct the trees for each clause, and we only put clause, phrase and word nodes in the trees.
After the construction, each clause has phrase children, which are all phrases that are contained (as monad set) in the clause, and likewise every phrase has word children, which are all words contained in that phrase.

We do not have phrases inside phrases. All phrases occur at the same level, but they are ordered by the canonical ordering: the phrase that contains the first monad that is not contained in the other phrase comes first.

In [6]:
tree_types = ('clause', 'phrase', 'word')
(root_type, leaf_type, clause_type) = (tree_types[0], tree_types[-1], 'clause')

tree = Tree(API, otypes=tree_types, 
    clause_type=clause_type,
    ccr_feature=None,
    pt_feature=None,
    pos_feature='sp',
    mother_feature =None,
)
results = tree.relations()
parent = results['eparent']
children = results['echildren']
msg("Ready for processing")

  0.00s LOADING API with EXTRAs: please wait ... 
  0.02s INFO: USING DATA COMPILED AT: 2014-10-23T15-58-52
  1.37s INFO: DATA LOADED FROM SOURCE etcbc4s AND ANNOX -- FOR TASK clausecomplements AT 2015-02-18T13-15-18
  0.00s Start computing parent and children relations for objects of type clause, phrase, word
  1.32s 100000 nodes
  2.71s 200000 nodes
  4.02s 300000 nodes
  5.42s 400000 nodes
  6.71s 500000 nodes
  7.99s 600000 nodes
  9.41s 700000 nodes
    10s 768200 nodes: 680207 have parents and 341633 have children
    10s Ready for processing


# Make a passage index for the clauses

In [7]:
msg("Making passage index ...")
cur_book = None
cur_chapter = None
cur_verse = None
clause_passage = {}
for n in NN():
    otype = F.otype.v(n)
    if otype == 'book': cur_book = F.book.v(n)
    elif otype == 'chapter': cur_chapter = F.chapter.v(n)
    elif otype == 'verse': cur_verse = F.verse.v(n)
    elif otype == 'clause': clause_passage[n] = (cur_book, cur_chapter, cur_verse)
nclauses = len(clause_passage)
clause_order = sorted(clause_passage)
msg("Passage index created for {} clauses".format(nclauses))

    15s Making passage index ...
    17s Passage index created for 87993 clauses


# Make an index of the transcriptions of clauses and phrases

We store the transcribed texts for words, phrases and clauses.
In phrases, we separate the first word from the rest by means of an % instead of a space.
This makes it easier to implement some logic based on the first word of a phrase.
We do this in the consonantal transcriptions only.

In [8]:
msg("Making transcription index ...")
node_transcr = {}
node_transcr_c = {}

for clause in clause_order:
    clause_transcr = ''
    clause_transcr_c = ''
    for phrase in children[clause]:
        phrase_transcr = ''
        phrase_transcr_c = ''
        phword_sep = '%'
        for word in children[phrase]:
            word_transcr = F.g_word.v(word) + tr.from_hebrew(F.trailer_utf8.v(word)).replace('_',' ').replace('\n',' ')
            word_transcr_c = F.g_cons.v(word)
            node_transcr[word] = word_transcr.rstrip(' ')
            node_transcr_c[word] = word_transcr_c
            phrase_transcr += word_transcr
            phrase_transcr_c += word_transcr_c + phword_sep
            phword_sep = ' '
            clause_transcr += word_transcr
            clause_transcr_c += word_transcr_c + ' '
        node_transcr[phrase] = phrase_transcr.rstrip(' ')
        node_transcr_c[phrase] = phrase_transcr_c.rstrip(' %')
    node_transcr[clause] = clause_transcr.rstrip(' ')
    node_transcr_c[clause] = clause_transcr_c.rstrip(' ')
msg("Transcription index created for {} nodes".format(len(node_transcr)))

    23s Making transcription index ...
    28s Transcription index created for 768200 nodes


# Explore phrase features

We are interested in the phrase function and the phrase determination

In [9]:
msg("Exploring phrase features ...")
phrase_functions = collections.defaultdict(lambda: 0)
phrase_det = collections.defaultdict(lambda: set())
for p in NN():
    otype = F.otype.v(p)
    if otype == 'phrase':
        phrase_functions[F.function.v(p)] += 1
        phrase_det[node_transcr_c[p]].add('d' if F.det.v(p) == 'det' else 'u' if F.det.v(p) == 'und' else 'n')

for value in sorted(phrase_functions):
    print("{:<20} {:>6d} x".format(value, phrase_functions[value]))

phrase_det_code = {}
for (p, dets) in phrase_det.items():
    d = len(dets)
    phrase_det_code[p] = '{};"{}"'.format(d, ''.join(sorted(dets)))
of = outfile('phrase_det.csv')
for p in sorted(phrase_det_code):
    dets = phrase_det_code[p]
    of.write('"{}";{}\n'.format(p, dets))
of.close()
msg("End exploring phrase features")

    33s Exploring phrase features ...
    36s End exploring phrase features


Adju                   9700 x
Cmpl                  29602 x
Conj                  46168 x
EPPr                      4 x
ExsS                     14 x
Exst                    144 x
Frnt                   1069 x
IntS                    250 x
Intj                   1625 x
Loca                   2811 x
ModS                     36 x
Modi                   3823 x
NCoS                    101 x
NCop                    605 x
Nega                   6054 x
Objc                  22211 x
PrAd                    138 x
PrcS                      8 x
PreC                  18779 x
PreO                   5509 x
PreS                    780 x
Pred                  57055 x
PtcO                    162 x
Ques                   1268 x
Rela                   6379 x
Subj                  31031 x
Supp                    298 x
Time                   3787 x
Unkn                   2639 x
Voct                   1590 x


We are also interested in phrases with a particular function.

#Pick up the complements(s)

Also collect the lexemes of the verbs in the verb phrases.

In [10]:
msg("Picking up complements ...")
comptype_spec = '''
predicate    : Pred PreO PreS PtcO PreC
subject      : Subj PreS ExsS IntS ModS NCoS
object       : Objc PreO PtcO
complement   : Cmpl
adjunct      : Adju PrAd Supp Modi
situation    : Loca Time
rest         : PrcS
unclassified : *
'''

comptype_order = []
comptype_inv = {}
comptype = {}

lines = comptype_spec.split('\n')
l = 0
for line in lines:
    if line.strip() == '': continue
    (ctstr, funcstr) = line.split(':')
    ctype = ctstr.strip()
    functions = funcstr.strip().split()
    comptype_order.append(ctype)
    comptype_inv[ctype] = l
    for func in functions:
        comptype[func] = l
    l += 1

vp_lexemes = {}
clause_phrases = collections.defaultdict(lambda: collections.defaultdict(lambda: []))

for clause in clause_order:
    phrases = children[clause]
    verbs = []
    for p in phrases:
        cpt = comptype.get(F.function.v(p), comptype_inv['unclassified'])
        clause_phrases[cpt][clause].append(p)
        if cpt == comptype_inv['predicate']:
            for word in children[p]:
                if F.sp.v(word) == 'verb':
                    verbs.append(F.lex.v(word).rstrip('[/='))
    vp_lexemes[clause] = ' '.join(verbs)


phrase_distribution = collections.defaultdict(lambda: collections.defaultdict(lambda:0))
for ctp in sorted(clause_phrases):
    for clause in clause_phrases[ctp]:
        phrase_distribution[ctp][len(clause_phrases[ctp][clause])] += 1
    
msg("{} clauses".format(
    nclauses, 
))

maxnphrases = {}
for ctp in sorted(phrase_distribution):
    for n in sorted(phrase_distribution[ctp]):
        maxnphrases[ctp] = n
        print("{:>5} clauses with {:>2} {}{}".format(
            phrase_distribution[ctp][n], 
            n, 
            comptype_order[ctp], 
            's' if n != 1 else '',
        ))
for ctp in sorted(maxnphrases):
    print("There are at most {} {}s in a clause".format(maxnphrases[ctp], comptype_order[ctp]))

    44s Picking up complements ...
    45s 87993 clauses


71638 clauses with  1 predicate
 2098 clauses with  2 predicates
32212 clauses with  1 subject
25789 clauses with  1 object
 1045 clauses with  2 objects
    1 clauses with  3 objects
27974 clauses with  1 complement
  799 clauses with  2 complements
   10 clauses with  3 complements
12015 clauses with  1 adjunct
  850 clauses with  2 adjuncts
   69 clauses with  3 adjuncts
    8 clauses with  4 adjuncts
    1 clauses with  5 adjuncts
 5776 clauses with  1 situation
  336 clauses with  2 situations
   43 clauses with  3 situations
    4 clauses with  4 situations
    1 clauses with  5 situations
    8 clauses with  1 rest
51815 clauses with  1 unclassified
 6573 clauses with  2 unclassifieds
  498 clauses with  3 unclassifieds
  140 clauses with  4 unclassifieds
   53 clauses with  5 unclassifieds
   17 clauses with  6 unclassifieds
   16 clauses with  7 unclassifieds
    4 clauses with  8 unclassifieds
    1 clauses with  9 unclassifieds
    1 clauses with 10 unclassifieds
There are a

# Print the data

We output everything as a tab separated file.
If a complement type occurs multiple times in a clause, they will occupy separate columns.
We know the maximum number of times that each complement type actually occurs.

There are also statistics columns, they contain the number of occurrences for each complement type in that clause.
Handy for sorting.

In [11]:
sep = ';'
of = outfile('clauses.csv')
header = '"passage"' + sep + '"verblex"' + sep
for cpt in range(len(comptype_order) - 1):
    header += '"#{}"{}'.format(comptype_order[cpt], sep)
for cpt in range(len(comptype_order) - 1):
    for i in range(maxnphrases[cpt]):
        header += '"{}{}"{}'.format(comptype_order[cpt], i + 1, sep)
header += '"clause"\n'
of.write(header)

for clause in clause_order:
    passage = '{} {}:{}'.format(*clause_passage[clause])
    verb_lex = vp_lexemes[clause]
    complements = []
    stats = []
    for cpt in range(len(comptype_order) - 1):
        phrases = clause_phrases[cpt][clause]
        nphrases = len(phrases)
        stats.append(str(nphrases))
        for i in range(maxnphrases[cpt]):
            det = F.det.v(phrases[i]) if i < nphrases else ''
            detstr = '1' if det == 'det' else '0'
            complements.append(detstr + ' ' + node_transcr_c[phrases[i]] if i < nphrases else '')

    of.write(('"{}"' + sep + '"{}"' + sep + '{}' + sep + '"{}"' + sep + '"{}"\n').format(
        passage, 
        verb_lex, 
        sep.join(stats),
        ('"' + sep + '"').join(complements),
        node_transcr_c[clause],
    ))
of.close()
msg("{} clauses".format(nclauses))

 1m 06s 87993 clauses
