<img align="right" src="images/dans-small.png"/>
<img align="right" src="images/tf-small.png"/>
<img align="right" src="images/etcbc.png"/>


# Corrections and enrichment

In order to do
[verbal valence analysis](flowchart.ipynb)
on verbs, we need to correct some coding errors.

We also need to enrich constituents surrounding the
verb occurrences with higher level features, that can be used
as input for the flow chart decisions.

Read more in the [wiki](https://github.com/ETCBC/valence/wiki/Workflows).

# Pipeline
See [operation](https://github.com/ETCBC/pipeline/blob/master/README.md#operation)
for how to run this script in the pipeline.

This notebook processes excel sheets with manual corrections and enrichments.
These have been entered against the `4b` version.
However, the `4b` version in this repository has been regenerated from scratch,
and in that process the node numbers have changed.
As the sheets rely on node numbers to let the entered data flow back to the right nodes,
these sheets no longer work on this version.
It should be possible to identify the material in those sheets on the basis of
book, chapter and verse info.
But we leave that as an exercise to posterity.


For all other versions, we keep the mechanism in place, but for now we work with zero manual input
for those versions.

As far as *corrections* are concerned: we expect to see them turn up in the continuous version `c`
of the core [BHSA](https://github.com/ETCBC/bhsa) data.

As far as *enrichments* are concerned: there are very few manual enrichments.
Most of the cases are handled by the algorithm in the notebook.

We recommend to harvest exceptions in the notebook itself, it has already a mechanism to apply
verb specific logic.

In [None]:
import sys
import os
import collections
from copy import deepcopy
import utils
from tf.fabric import Fabric

In [None]:
if "SCRIPT" not in locals():
    SCRIPT = False
    FORCE = True
    CORE_NAME = "bhsa"
    NAME = "valence"
    VERSION = "2021"

In [1]:
def stop(good=False):
    if SCRIPT:
        sys.exit(0 if good else 1)

## Authors

[Janet Dyk and Dirk Roorda](https://github.com/ETCBC/valence/wiki/Authors)

Last modified 2017-09-13.

## References

[References](https://github.com/ETCBC/valence/wiki/References)

## Data
We have carried out the valence project against the Hebrew Text Database BHSA, version `4b`.
See the description of the [sources](https://github.com/ETCBC/valence/wiki/Sources).

However, we can run our stuff also against the newer versions.

## Implementation

Start the engines. We use the Python package
[text-fabric](https://github.com/Dans-labs/text-fabric)
to process the data of the Hebrew Text Database smoothly and efficiently.

# Setting up the context: source file and target directories

The conversion is executed in an environment of directories, so that sources, temp files and
results are in convenient places and do not have to be shifted around.

In[3]:

In [None]:
repoBase = os.path.expanduser("~/github/etcbc")
coreRepo = "{}/{}".format(repoBase, CORE_NAME)
thisRepo = "{}/{}".format(repoBase, NAME)

In [None]:
coreTf = "{}/tf/{}".format(coreRepo, VERSION)

In [None]:
thisSource = "{}/source/{}".format(thisRepo, VERSION)
thisTemp = "{}/_temp/{}".format(thisRepo, VERSION)
thisTempTf = "{}/tf".format(thisTemp)

In [3]:
thisTf = "{}/tf/{}".format(thisRepo, VERSION)

# Test

Check whether this conversion is needed in the first place.
Only when run as a script.

In[4]:

In [4]:
if SCRIPT:
    (good, work) = utils.mustRun(
        None, "{}/.tf/{}.tfx".format(thisTf, "valence"), force=FORCE
    )
    print(good, work)
    if not good:
        stop(good=False)
    if not work:
        stop(good=True)

# Loading the feature data

We load the features we need from the BHSA core database.

In[5]:

In [5]:
utils.caption(4, "Load the existing TF dataset")
TF = Fabric(locations=coreTf, modules=[""])

..............................................................................................
.       0.00s Load the existing TF dataset                                                   .
..............................................................................................
This is Text-Fabric 8.5.13
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

88 features found and 0 ignored


We instruct the API to load data.

In[6]:

In [6]:
api = TF.load(
    """
    lex gloss lex_utf8
    sp vs lex uvf prs nametype ls
    function rela typ
    mother
"""
)
api.makeAvailableIn(globals())

  0.00s loading features ...
   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API
  4.86s All features loaded/computed - for details use loadLog()


[('Computed',
  'computed-data',
  ('C Computed', 'Call AllComputeds', 'Cs ComputedString')),
 ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')),
 ('Fabric', 'loading', ('TF',)),
 ('Locality', 'locality', ('L Locality',)),
 ('Nodes', 'navigating-nodes', ('N Nodes',)),
 ('Features',
  'node-features',
  ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')),
 ('Search', 'search', ('S Search',)),
 ('Text', 'text', ('T Text',))]

# Locations

In[7]:

In [None]:
linkShebanq = "https://shebanq.ancient-data.org/hebrew/text"
linkPassage = "?book={}&chapter={}&verse={}"
linkAppearance = "&version={}&mr=m&qw=n&tp=txt_tb1&tr=hb&wget=x&qget=v&nget=x".format(
    VERSION
)

In [None]:
resultDir = "{}/results".format(thisTemp)
allResults = "{}/all.csv".format(resultDir)
selectedResults = "{}/selected.csv".format(resultDir)
kinds = ("corr_blank", "corr_filled", "enrich_blank", "enrich_filled")
kdir = {}
for k in kinds:
    kd = "{}/{}".format(thisSource, k)
    kdir[k] = kd
    if not os.path.exists(kd):
        os.makedirs(kd)
if not os.path.exists(resultDir):
    os.makedirs(resultDir)

In [7]:
def vfile(verb, kind):
    if kind not in kinds:
        utils.caption(0, "ERROR: Unknown kind `{}`".format(kind))
        return None
    baseName = verb.replace(">", "a").replace("<", "o")
    return (baseName, "{}/{}.csv".format(kdir[kind], baseName))

# Domain
Here is a subset of verbs that interest us.
In fact, we are interested in all verbs, but we have subjected the occurrences of these verbs to closer inspection,
together with the contexts they occur in.

Manual additions in the correction and enrichment workflow can only happen for selected verbs.

In[8]:

In [None]:
verbs_initial = set(
    """
    CJT
    BR>
    QR>
""".strip().split()
)

In [None]:
motion_verbs = set(
    """
    <BR
    <LH
    BW>
    CWB
    HLK
    JRD
    JY>
    NPL
    NWS
    SWR
""".strip().split()
)

In [None]:
double_object_verbs = set(
    """
    NTN
    <FH
    FJM
""".strip().split()
)

In [None]:
complex_qal_verbs = set(
    """
    NF>
    PQD
""".strip().split()
)

In [8]:
verbs = verbs_initial | motion_verbs | double_object_verbs | complex_qal_verbs

# 1. Correction workflow

## 1.1 Phrase function

We need to correct some values of the phrase function.
When we receive the corrections, we check whether they have legal values.
Here we look up the possible values.

In[9]:

In [9]:
predicate_functions = {
    "Pred",
    "PreS",
    "PreO",
    "PreC",
    "PtcO",
    "PrcS",
}

In[10]:

In [10]:
legal_values = dict(
    function={F.function.v(p) for p in F.otype.s("phrase")},
)

We generate a list of occurrences of those verbs, organized by the lexeme of the verb.
We need some extra values, to indicate other coding errors.

In[11]:

In [11]:
error_values = dict(
    function=dict(
        BoundErr="this constituent is part of another constituent and does not merit its own function/type/rela value",
    ),
)

We add the `error_values` to the legal values.

In[12]:

In [12]:
for feature in set(legal_values.keys()) | set(error_values.keys()):
    ev = error_values.get(feature, {})
    if ev:
        lv = legal_values.setdefault(feature, set())
        lv |= set(ev.keys())
if not SCRIPT:
    utils.caption(0, "{}".format(legal_values))

|         37s {'function': {'Frnt', 'Conj', 'Modi', 'BoundErr', 'PrcS', 'Objc', 'Subj', 'NCop', 'Pred', 'PreS', 'Cmpl', 'Intj', 'Nega', 'Adju', 'PreC', 'Supp', 'Ques', 'Loca', 'NCoS', 'PreO', 'IntS', 'Rela', 'ModS', 'PtcO', 'PrAd', 'Exst', 'EPPr', 'Voct', 'ExsS', 'Time'}}


In[13]:

In [13]:
utils.caption(4, "Finding occurrences ...")
occs = collections.defaultdict(
    list
)  # dictionary of verb occurrence nodes per verb lexeme
npoccs = collections.defaultdict(list)  # same, but those not occurring in a "predicate"
clause_verb = collections.defaultdict(
    list
)  # dictionary of verb occurrence nodes per clause node
sel_clause_verb = collections.defaultdict(
    list
)  # dictionary of selected verb occurrence nodes per clause node
clause_verb_index = collections.defaultdict(
    set
)  # mapping from clauses to its main verb(s)
sel_clause_verb_index = collections.defaultdict(
    set
)  # mapping from clauses to its main verb(s), for selected verbs
verb_clause_index = collections.defaultdict(
    list
)  # mapping from verbs to the clauses of which it is main verb
sel_verb_clause_index = collections.defaultdict(
    list
)  # mapping from selected verbs to the clauses of which it is main verb

..............................................................................................
.         46s Finding occurrences ...                                                        .
..............................................................................................
|         48s 	Done
|         48s 	All:      1380 verbs with  73710 verb occurrences in 70150 clauses
|         48s 	Selected:   18 verbs with  16209 verb occurrences in 16052 clauses
|         48s 	<BR   556 occurrences of which   33 outside a predicate phrase
|         48s 	<FH  2629 occurrences of which   59 outside a predicate phrase
|         48s 	<LH   890 occurrences of which   10 outside a predicate phrase
|         48s 	BR>    54 occurrences of which    3 outside a predicate phrase
|         48s 	BW>  2570 occurrences of which   27 outside a predicate phrase
|         48s 	CJT    85 occurrences of which    1 outside a predicate phrase
|         48s 	CWB  1056 occurrences of which   22 outside a pr

In [None]:
nw = 0
sel_nw = 0
for w in F.otype.s("word"):
    if F.sp.v(w) != "verb":
        continue
    lex = F.lex.v(w).rstrip("[=")
    nw += 1
    pf = F.function.v(L.u(w, "phrase")[0])
    if pf not in predicate_functions:
        npoccs[lex].append(w)
    occs[lex].append(w)
    cn = L.u(w, "clause")[0]
    clause_verb[cn].append(w)
    clause_verb_index[cn].add(lex)
    verb_clause_index[lex].append(cn)
    if lex in verbs:
        sel_nw += 1
        sel_clause_verb[cn].append(w)
        sel_clause_verb_index[cn].add(lex)

In [None]:
sel_verb_clause_index = dict(
    (lex, cns) for (lex, cns) in verb_clause_index.items() if lex in verbs
)
sel_clause_verb

In [None]:
utils.caption(0, "\tDone")
utils.caption(
    0,
    "\tAll:      {:>4} verbs with {:>6} verb occurrences in {} clauses".format(
        len(verb_clause_index), nw, len(clause_verb)
    ),
)
utils.caption(
    0,
    "\tSelected: {:>4} verbs with {:>6} verb occurrences in {} clauses".format(
        len(sel_verb_clause_index), sel_nw, len(sel_clause_verb)
    ),
)

In [None]:
for verb in sorted(verbs):
    utils.caption(
        0,
        "\t{} {:>5} occurrences of which {:>4} outside a predicate phrase".format(
            verb,
            len(occs[verb]),
            len(npoccs[verb]),
        ),
    )

# 1.2 Blank sheet generation
Generate correction sheets.
They are CSV files. Every row corresponds to a verb occurrence.
The fields per row are the node numbers of the clause in which the verb occurs, the node number of the verb occurrence, the text of the verb occurrence (in ETCBC transliteration, consonantal) a passage label (book, chapter, verse), and then 4 columns for each phrase in the clause:

* phrase node number
* phrase text (ETCBC transliterated consonantal)
* original value of the `function` feature
* corrected value of the `function` feature (generated as empty)

In[14]:

In [None]:
utils.caption(4, "Generating blank correction sheets ...")
sheetKind = "corr_blank"
utils.caption(0, "\tas {}".format(vfile("{verb}", sheetKind)[1]))

In [None]:
phrases_seen = collections.Counter()

In [None]:
def gen_sheet(verb):
    rows = []
    fieldsep = ";"
    field_names = """
        clause#
        word#
        passage
        link
        verb
        stem
    """.strip().split()
    max_phrases = 0
    clauses_seen = set()
    for wn in occs[verb]:
        cln = L.u(wn, "clause")[0]
        if cln in clauses_seen:
            continue
        clauses_seen.add(cln)
        vn = L.u(wn, "verse")[0]
        (bookName, ch, vs) = T.sectionFromNode(vn, lang="la")
        passage_label = "{} {}:{}".format(*T.sectionFromNode(vn))
        ln = linkShebanq + (linkPassage.format(bookName, ch, vs)) + linkAppearance
        lnx = '''"=HYPERLINK(""{}""; ""link"")"'''.format(ln)
        vt = T.text([wn], fmt="text-trans-plain")
        vstem = F.vs.v(wn)
        np = "* " if wn in npoccs[verb] else ""
        row = [cln, wn, passage_label, lnx, np + vt, vstem]
        phrases = L.d(cln, "phrase")
        n_phrases = len(phrases)
        if n_phrases > max_phrases:
            max_phrases = n_phrases
        for pn in phrases:
            phrases_seen[pn] += 1
            pt = T.text(L.d(pn, "word"), fmt="text-trans-plain")
            pf = F.function.v(pn)
            pnp = np if pf in predicate_functions else ""
            row.extend((pn, pnp + pt, pf, ""))
        rows.append(row)
    for i in range(max_phrases):
        field_names.extend(
            """
            phr{i}#
            phr{i}_txt
            phr{i}_function
            phr{i}_corr
        """.format(
                i=i + 1
            )
            .strip()
            .split()
        )
    location = vfile(verb, sheetKind)
    if location is None:
        return
    (baseName, fileName) = location
    row_file = open(fileName, "w")
    row_file.write("{}\n".format(fieldsep.join(field_names)))
    for row in rows:
        row_file.write("{}\n".format(fieldsep.join(str(x) for x in row)))
    row_file.close()
    utils.caption(0, "\t\tfor verb {}".format(baseName))

In [None]:
for verb in verbs:
    gen_sheet(verb)

In [14]:
stats = collections.Counter()
for (p, times) in phrases_seen.items():
    stats[times] += 1
for (times, n) in sorted(stats.items(), key=lambda y: (-y[1], y[0])):
    utils.caption(0, "\t{:<6} phrases seen {:<2} time(s)".format(n, times))
utils.caption(0, "\tTotal phrases seen: {}".format(len(phrases_seen)))

..............................................................................................
.      1m 06s Generating blank correction sheets ...                                         .
..............................................................................................
|      1m 06s 	as /Users/dirk/github/etcbc/valence/source/2021/corr_blank/{verb}.csv
|      1m 06s 		for verb oFH
|      1m 06s 		for verb PQD
|      1m 06s 		for verb BWa
|      1m 07s 		for verb NFa
|      1m 07s 		for verb NPL
|      1m 07s 		for verb QRa
|      1m 07s 		for verb HLK
|      1m 07s 		for verb SWR
|      1m 07s 		for verb FJM
|      1m 07s 		for verb NWS
|      1m 07s 		for verb CWB
|      1m 07s 		for verb oLH
|      1m 08s 		for verb NTN
|      1m 08s 		for verb BRa
|      1m 08s 		for verb CJT
|      1m 08s 		for verb JYa
|      1m 08s 		for verb oBR
|      1m 08s 		for verb JRD
|      1m 08s 	52060  phrases seen 1  time(s)
|      1m 08s 	185    phrases seen 2  time(s)
|      1m 08s 	9

# 1.3 Processing corrections
We read the filled-in correction sheets and extract the correction data out of it.
We store the corrections in a dictionary keyed by the phrase node.
We check whether we get multiple corrections for the same phrase.

In[15]:

In [None]:
utils.caption(4, "Processing filled correction sheets ...")
sheetKind = "corr_filled"
utils.caption(0, "\tas {}".format(vfile("{verb}", sheetKind)[1]))

In [None]:
phrases_seen = collections.Counter()
pf_corr = {}

In [None]:
def read_corr():
    function_values = legal_values["function"]

    for verb in sorted(verbs):
        repeated = collections.defaultdict(list)
        non_phrase = set()
        illegal_fvalue = set()
        nodeNumberErrors = []

        location = vfile(verb, sheetKind)
        if location is None:
            continue
        (baseName, fileName) = location
        if not os.path.exists(fileName):
            utils.caption(0, "\t\tNO file for {}".format(baseName))
            continue
        else:
            utils.caption(0, "\t\tverb {}".format(baseName))
        with open(fileName) as f:
            for (i, line) in enumerate(f):
                fields = line.rstrip().split(";")
                cn = int(fields[0])
                wn = int(fields[1])
                if F.otype.v(cn) != "clause":
                    nodeNumberErrors.append([i, "{} is not a clause node".format(cn)])
                if F.otype.v(wn) != "word":
                    nodeNumberErrors.append([i, "{} is not a word node".format(wn)])
                words = set(L.d(cn, "word"))
                phrases = set(L.d(cn, "phrase"))
                if wn not in words:
                    nodeNumberErrors.append(
                        [i, "{} is not a word of clause {}".format(wn, cn)]
                    )
                for i in range(1, len(fields) // 4):
                    (pn, pc) = (fields[2 + 4 * i], fields[2 + 4 * i + 3])
                    if pn != "":
                        pn = int(pn)
                        if F.otype.v(pn) != "phrase":
                            nodeNumberErrors.append(
                                [i, "{} is not a phrase node".format(pn)]
                            )
                        if pn not in phrases:
                            nodeNumberErrors.append(
                                [i, "{} is not a phrase of clause {}".format(pn, cn)]
                            )
                        pc = pc.strip()
                        phrases_seen[pn] += 1
                        if pc != "":
                            good = True
                            for i in [1]:
                                good = False
                                if pn in pf_corr:
                                    repeated[pn] += pc
                                    continue
                                if pc not in function_values:
                                    illegal_fvalue.add(pc)
                                    continue
                                good = True
                            if good:
                                pf_corr[pn] = pc

        utils.caption(
            0,
            "\t{}: Found {:>5} corrections in {}".format(verb, len(pf_corr), fileName),
        )
        if len(nodeNumberErrors):
            for (i, msg) in nodeNumberErrors:
                utils.caption(0, "ERROR: Line {:>3}: {}".format(i + 1, msg))
        else:
            utils.caption(0, "\tOK: node numbers in sheet are consistent")
        if len(repeated):
            utils.caption(0, "ERROR: Some phrases have been corrected multiple times!")
            for x in sorted(repeated):
                utils.caption(0, "\t{:>6}: {}".format(x, ", ".join(repeated[x])))
        else:
            utils.caption(
                0, "\tOK: Corrected phrases did not receive multiple corrections"
            )
        if len(non_phrase):
            utils.caption(
                0,
                "ERROR: Corrections have been applied to non-phrase nodes: {}".format(
                    ",".join(non_phrase)
                ),
            )
        else:
            utils.caption(0, "\tOK: all corrected nodes where phrase nodes")
        if len(illegal_fvalue):
            utils.caption(
                0, "ERROR: Some corrections supply illegal values for phrase function!"
            )
            utils.caption(0, "\t`{}`".format("`, `".join(illegal_fvalue)))
        else:
            utils.caption(0, "\tOK: all corrected values are legal")
    utils.caption(
        0, "\tFound {} corrections in the phrase function".format(len(pf_corr))
    )

In [None]:
read_corr()

In [15]:
stats = collections.Counter()
for (p, times) in phrases_seen.items():
    stats[times] += 1
for (times, n) in sorted(stats.items(), key=lambda y: (-y[1], y[0])):
    utils.caption(0, "\t{:<6} phrases seen {:<2} time(s)".format(n, times))
utils.caption(0, "\tTotal phrases seen: {}".format(len(phrases_seen)))

..............................................................................................
.      1m 21s Processing filled correction sheets ...                                        .
..............................................................................................
|      1m 21s 	as /Users/dirk/github/etcbc/valence/source/2021/corr_filled/{verb}.csv
|      1m 21s 		NO file for oBR
|      1m 21s 		NO file for oFH
|      1m 21s 		NO file for oLH
|      1m 21s 		NO file for BRa
|      1m 21s 		NO file for BWa
|      1m 21s 		NO file for CJT
|      1m 21s 		NO file for CWB
|      1m 21s 		NO file for FJM
|      1m 21s 		NO file for HLK
|      1m 21s 		NO file for JRD
|      1m 21s 		NO file for JYa
|      1m 21s 		NO file for NFa
|      1m 21s 		NO file for NPL
|      1m 21s 		NO file for NTN
|      1m 21s 		NO file for NWS
|      1m 21s 		NO file for PQD
|      1m 21s 		NO file for QRa
|      1m 21s 		NO file for SWR
|      1m 21s 	Found 0 corrections in the phrase func

# 2. Enrichment workflow

We create blank sheets for new feature assignments, based on the corrected data.

In[16]:

In [16]:
enrich_field_spec = """
valence
    adjunct
    complement
    core

predication
    NA
    regular
    copula

grammatical
    NA
    subject
    principal_direct_object
    direct_object
    NP_direct_object
    indirect_object
    L_object
    K_object
    infinitive_object
    *

original
    NA
    subject
    principal_direct_object
    direct_object
    NP_direct_object
    indirect_object
    L_object
    K_object
    infinitive_object
    *

lexical
    location
    time

semantic
    benefactive
    time
    location
    instrument
    manner
"""
enrich_fields = collections.OrderedDict()
cur_e = None
for line in enrich_field_spec.strip().split("\n"):
    if line.startswith(" "):
        enrich_fields.setdefault(cur_e, set()).add(line.strip())
    else:
        cur_e = line.strip()
nef = len(enrich_fields)
if None in enrich_fields:
    utils.caption(0, "ERROR: Invalid enrich field specification")
else:
    utils.caption(4, "{} Enrich field specifications OK".format(nef))
for (ef, fields) in sorted(enrich_fields.items()):
    utils.caption(0, "\t{} has possible values".format(ef))
    for field in fields:
        utils.caption(0, "\t\t{}".format(field))

..............................................................................................
.      1m 50s 6 Enrich field specifications OK                                               .
..............................................................................................
|      1m 50s 	grammatical has possible values
|      1m 50s 		direct_object
|      1m 50s 		indirect_object
|      1m 50s 		subject
|      1m 50s 		NP_direct_object
|      1m 50s 		infinitive_object
|      1m 50s 		NA
|      1m 50s 		L_object
|      1m 50s 		*
|      1m 50s 		principal_direct_object
|      1m 50s 		K_object
|      1m 50s 	lexical has possible values
|      1m 50s 		location
|      1m 50s 		time
|      1m 50s 	original has possible values
|      1m 50s 		direct_object
|      1m 50s 		indirect_object
|      1m 50s 		subject
|      1m 50s 		NP_direct_object
|      1m 50s 		infinitive_object
|      1m 50s 		NA
|      1m 50s 		L_object
|      1m 50s 		*
|      1m 50s 		principal_direct_object


In[17]:

In [17]:
enrich_baseline_rules = dict(
    phrase="""Adju	Adjunct	adjunct	NA	NA			
Cmpl	Complement	complement	NA	*			
Conj	Conjunction	NA	NA	NA		NA	NA
EPPr	Enclitic personal pronoun	NA	copula	NA			
ExsS	Existence with subject suffix	core	copula	subject			
Exst	Existence	core	copula	NA			
Frnt	Fronted element	NA	NA	NA		NA	NA
Intj	Interjection	NA	NA	NA		NA	NA
IntS	Interjection with subject suffix	core	NA	subject			
Loca	Locative	adjunct	NA	NA		location	location
Modi	Modifier	NA	NA	NA		NA	NA
ModS	Modifier with subject suffix	core	NA	subject			
NCop	Negative copula	core	copula	NA			
NCoS	Negative copula with subject suffix	core	copula	subject			
Nega	Negation	NA	NA	NA		NA	NA
Objc	Object	complement	NA	direct_object			
PrAd	Predicative adjunct	adjunct	NA	NA			
PrcS	Predicate complement with subject suffix	core	regular	subject			
PreC	Predicate complement	core	regular	NA			
Pred	Predicate	core	regular	NA			
PreO	Predicate with object suffix	core	regular	direct_object			
PreS	Predicate with subject suffix	core	regular	subject			
PtcO	Participle with object suffix	core	regular	direct_object			
Ques	Question	NA	NA	NA		NA	NA
Rela	Relative	NA	NA	NA		NA	NA
Subj	Subject	core	NA	subject			
Supp	Supplementary constituent	adjunct	NA	NA			benefactive
Time	Time reference	adjunct	NA	NA		time	time
Unkn	Unknown	NA	NA	NA		NA	NA
Voct	Vocative	NA	NA	NA		NA	NA""",  # noqa W291
    clause="""Objc	Object	complement	NA	direct_object			
InfC	Infinitive Construct clause	NA	NA				""",  # noqa W291
)

In[18]:

In [None]:
utils.caption(4, "\tChecking enrich baseline rules")
transform = collections.OrderedDict((("phrase", {}), ("clause", {})))
errors = 0
good = 0

In [18]:
for kind in ("phrase", "clause"):
    for line in enrich_baseline_rules[kind].split("\n"):
        x = line.split("\t")
        nefields = len(x) - 2
        if len(x) - 2 != nef:
            utils.caption(
                0,
                "ERROR: Wrong number of fields ({} must be {}) in {}:\n{}".format(
                    nefields, nef, kind, line
                ),
            )
            errors += 1
        transform[kind][x[0]] = dict(zip(enrich_fields, x[2:]))
    for e in error_values["function"]:
        transform[kind][e] = dict(zip(enrich_fields, [""] * nef))

    for f in transform[kind]:
        for e in enrich_fields:
            val = transform[kind][f][e]
            if val != "" and val != "NA" and val not in enrich_fields[e]:
                utils.caption(
                    0,
                    'ERROR: Defaults for `{}` ({}): wrong `{}` value: "{}"'.format(
                        f, kind, e, val
                    ),
                )
                errors += 1
            else:
                good += 1
if errors:
    utils.caption(0, "ERROR: There were {} errors ({} good)".format(errors, good))
else:
    utils.caption(0, "\tEnrich baseline rules are OK ({} good)".format(good))

..............................................................................................
.      2m 04s 	Checking enrich baseline rules                                                .
..............................................................................................
|      2m 04s 	Enrich baseline rules are OK (204 good)


Let us pretty-print the baseline rules of enrichment for easier reference.

In[19]:

In [19]:
if not SCRIPT:
    ltpl = "{:<8}: " + ("{:<15}" * nef)
    utils.caption(0, ltpl.format("func", *enrich_fields), continuation=True)
    for kind in transform:
        utils.caption(0, "[{}]".format(kind), continuation=True)
        for f in sorted(transform[kind]):
            sfs = transform[kind][f]
            utils.caption(
                0, ltpl.format(f, *[sfs[sf] for sf in enrich_fields]), continuation=True
            )

func    : valence        predication    grammatical    original       lexical        semantic       
[phrase]
Adju    : adjunct        NA             NA                                                          
BoundErr:                                                                                           
Cmpl    : complement     NA             *                                                           
Conj    : NA             NA             NA                            NA             NA             
EPPr    : NA             copula         NA                                                          
ExsS    : core           copula         subject                                                     
Exst    : core           copula         NA                                                          
Frnt    : NA             NA             NA                            NA             NA             
IntS    : core           NA             subject                                   

## 2.1 Enrichment logic

We apply enrichment logic to *all* verbs, not only to selected verbs.
But only selected verbs can receive manual enrichment enhancements.

For some verbs, selected or not, additional logic specific to that verb can be specified.

## 2.2 Direct objects

We have to do some work to identify (multiple) direct objects and indirect objects.

[More on direct objects](https://github.com/ETCBC/valence/wiki/Discussion#direct-objects)

In[20]:

In [None]:
objectfuncs = set(
    """
Objc PreO PtcO
""".strip().split()
)

In [None]:
cmpl_as_obj_preps = set(
    """
K L
""".strip().split()
)

In [20]:
no_prs = set(
    """
absent n/a
""".strip().split()
)

In[21]:

In [21]:
body_parts = set(
    """
>NP/ >P/ >PSJM/ >YB</ >ZN/
<JN/ <NQ/ <RP/ <YM/ <YM==/
BHN/ BHWN/ BVN/
CD=/ CD===/ CKM/ CN/
DD/
GRGRT/ GRM/ GRWN/ GW/ GW=/ GWJH/ GWPH/ GXWN/
FPH/
JD/ JRK/ JRKH/
KRF/ KSL=/ KTP/
L</ LCN/ LCWN/ LXJ/
M<H/ MPRQT/ MTL<WT/ MTNJM/ MYX/
NBLH=/
P<M/ PGR/ PH/ PM/ PNH/ PT=/
QRSL/
R>C/ RGL/
XDH/ XLY/ XMC=/ XRY/
YW>R/
ZRW</
""".strip().split()
)

In[24]:

In [None]:
utils.caption(4, "Finding direct objects and determining the principal one")
clause_objects = collections.defaultdict(set)
objects = collections.defaultdict(set)
objects_count = collections.defaultdict(collections.Counter)
object_kinds = (
    "principal",
    "direct",
    "NP",
    "L",
    "K",
    "clause",
    "infinitive",
)

In [None]:
def is_marked(phr):
    # simple criterion for determining whether a direct object is marked:
    # has it the object marker somewhere?
    words = L.d(p, "word")
    has_et = False
    for w in words:
        if F.lex.v(w) == ">T":
            has_et = True
            break
    return has_et

In [None]:
for c in clause_verb:
    these_objects = collections.defaultdict(set)
    direct_objects_cat = collections.defaultdict(set)

    for p in L.d(c, "phrase"):
        pf = pf_corr.get(
            p, F.function.v(p)
        )  # NB we take the corrected value for phrase function if there is one
        if pf in objectfuncs:
            direct_objects_cat["p_" + pf].add(p)
            these_objects["direct"].add(p)
        elif pf == "Cmpl":
            pwords = L.d(p, "word")
            w1 = pwords[0]
            w1l = F.lex.v(w1)
            w2l = F.lex.v(pwords[1]) if len(pwords) > 1 else None
            if (
                w1l in cmpl_as_obj_preps
                and F.prs.v(w1) in no_prs
                and not (w1l == "L" and w2l in body_parts)
            ):
                if w1l == "K":
                    these_objects["K"].add(p)
                elif w1l == "L":
                    these_objects["L"].add(p)

    # find clause objects
    for ac in L.d(L.u(c, "sentence")[0], "clause"):
        mothers = list(E.mother.f(ac))
        if not (mothers and mothers[0] == c):
            continue
        cr = F.rela.v(ac)
        ct = F.typ.v(ac)
        if cr in {"Objc"} or ct in {"InfC"}:
            clause_objects[c].add(ac)
            if cr in {"Objc"}:
                label = cr
                direct_objects_cat["c_" + label].add(ac)
                these_objects["direct"].add(ac)
                these_objects["clause"].add(ac)
            elif ct in {"InfC"}:
                if F.lex.v(L.d(ac, "word")[0]) == "L":
                    these_objects["infinitive"].add(ac)
        else:
            continue

    # order the objects in the natural ordering
    direct_objects_order = N.sortNodes(these_objects.get("direct", set()))
    nobjects = len(direct_objects_order)

    # compute the principal object
    principal_object = None

    for x in [1]:
        # just one object
        if nobjects == 1:
            # we have chosen not to mark a principal object if there is only one object
            # the alternative is to mark it if it is a phrase. Uncomment the next 2 lines if you want this
            # theobject = list(dobjects_set)[0]
            # if F.otype.v(theobject) == 'phrase': principal_object = theobject
            break
        # rule 1: suffixes and promoted objects
        principal_candidates = direct_objects_cat.get(
            "p_PreO", set()
        ) | direct_objects_cat.get("p_PtcO", set())
        if len(principal_candidates) != 0:
            principal_object = N.sortNodes(principal_candidates)[0]
            break
        principal_candidates = direct_objects_cat.get("p_Objc", set())
        if len(principal_candidates) != 0:
            if len(principal_candidates) == 1:
                principal_object = list(principal_candidates)[0]
                break
            objects_marked = set()
            objects_unmarked = set()
            for p in principal_candidates:
                if is_marked(p):
                    objects_marked.add(p)
                else:
                    objects_unmarked.add(p)
            if len(objects_marked) != 0:
                principal_object = N.sortNodes(objects_marked)[0]
                break
            if len(objects_unmarked) != 0:
                principal_object = N.sortNodes(objects_unmarked)[0]
                break
    if principal_object is not None:
        these_objects["principal"].add(principal_object)
    if len(these_objects["infinitive"]) and not len(these_objects["direct"]):
        # we do not mark an infinitive object if there is no proper direct object around
        these_objects["infinitive"] = set()
    if len(these_objects["principal"]):
        these_objects["direct"] -= these_objects["principal"]
        for x in these_objects["direct"] - these_objects["clause"]:
            # the NP objects are the non-principal phrase like direct objects
            these_objects["NP"].add(x)
        these_objects["direct"] -= these_objects["NP"]
    if (
        len(these_objects["principal"]) == 0
        and len(these_objects["direct"])
        and (
            len(these_objects["NP"])
            or len(these_objects["L"])
            or len(these_objects["K"])
            or len(these_objects["infinitive"])
        )
    ):  # promote the direct objects to principal direct objects
        these_objects["principal"] = these_objects["direct"]
        these_objects["direct"] = set()

    for kind in object_kinds:
        n = len(these_objects.get(kind, set()))
        objects_count[kind][n] += 1
        if n:
            objects[kind] |= these_objects[kind]

In [36]:
utils.caption(0, "\tDone")

..............................................................................................
.      6m 22s Generate blank enrichment sheets                                               .
..............................................................................................
|      6m 22s 	as /Users/dirk/github/etcbc/valence/source/2021/enrich_blank/{verb}.csv
|      6m 23s 		for verb <FH (11315 rows)
|      6m 23s 		for verb PQD ( 1283 rows)
|      6m 23s 		for verb BW> (11087 rows)
|      6m 24s 		for verb NF> ( 2886 rows)
|      6m 24s 		for verb NPL ( 1933 rows)
|      6m 24s 		for verb QR> ( 3726 rows)
|      6m 24s 		for verb HLK ( 5814 rows)
|      6m 24s 		for verb SWR ( 1272 rows)
|      6m 24s 		for verb FJM ( 2918 rows)
|      6m 24s 		for verb NWS (  618 rows)
|      6m 24s 		for verb CWB ( 4325 rows)
|      6m 25s 		for verb <LH ( 3893 rows)
|      6m 25s 		for verb NTN ( 9822 rows)
|      6m 25s 		for verb BR> (  219 rows)
|      6m 25s 		for verb CJT (  381 rows

In [24]:
for kind in object_kinds:
    total = 0
    for (count, n) in sorted(objects_count[kind].items(), key=lambda y: -y[0]):
        if count:
            total += n
        utils.caption(
            0,
            "\t{:>5} clauses with {:>2} {:>10} object{}".format(
                n, count, kind, "s" if count != 1 else ""
            ),
        )
    utils.caption(
        0, "\t{:>5} clauses with {:>2} {:>10} object".format(total, "a", kind)
    )

..............................................................................................
.      3m 57s Finding direct objects and determining the principal one                       .
..............................................................................................
|      4m 00s 	Done
|      4m 00s 	 3649 clauses with  1  principal object
|      4m 00s 	66501 clauses with  0  principal objects
|      4m 00s 	 3649 clauses with  a  principal object
|      4m 00s 	23694 clauses with  1     direct object
|      4m 00s 	46456 clauses with  0     direct objects
|      4m 00s 	23694 clauses with  a     direct object
|      4m 00s 	 1001 clauses with  1         NP object
|      4m 00s 	69149 clauses with  0         NP objects
|      4m 00s 	 1001 clauses with  a         NP object
|      4m 00s 	   33 clauses with  2          L objects
|      4m 00s 	 3828 clauses with  1          L object
|      4m 00s 	66289 clauses with  0          L objects
|      4m 00s 	 3861 clauses w

## 2.3 Indirect objects

The BHSA database has not feature that marks indirect objects.
We will use computation to determine whether a complement is an indirect object or a locative.
This computation is just an approximation.

[More on indirect objects](https://github.com/ETCBC/valence/wiki/Discussion#indirect-objects)

### The decision

We take a decision as follows.
Based on indicators $ind$ and $loc$ that are proxies for the degree in which the complement is an indirect object or a locative, we arrive at a decision $L$ (complement is *locative*) or $I$ (complement is *indirect object*) or $C$ (complement is neither *locative* nor *indirect object*) as follows:

(1) $ loc > 0 \wedge ind = 0 \Rightarrow L $

(2) $ loc = 0 \wedge ind > 0 \Rightarrow I $

(3) $ loc > 0 \wedge ind > 0 \wedge\ loc - 1 > ind \Rightarrow L$

(4) $ loc > 0 \wedge ind > 0 \wedge\ loc + 1 < ind \Rightarrow I$

(5) $ loc > 0 \wedge ind > 0 \wedge |ind - loc| <= 1 \Rightarrow C$

In[25]:

In [None]:
complfuncs = set(
    """
Cmpl PreC
""".strip().split()
)

In [25]:
cmpl_as_iobj_preps = set(
    """
L >L
""".strip().split()
)

In[26]:

In [26]:
locative_lexemes = set(
    """
>RY/ >YL/ >XR/
<BR/ <BRH/ <BWR/ <C==/ <JR/ <L=/ <LJ=/ <LJH/ <LJL/ <MD=/ <MDH/ <MH/ <MQ/ <MQ===/ <QB/
BJN/ BJT/
CM CMJM/ CMC/ C<R/
DRK/
FDH/
HR/
JM/ JRDN/ JRWCLM/ JFR>L/
MDBR/ MW<D/ MWL/ MZBX/ MYRJM/ MQWM/ MR>CWT/ MSB/ MSBH/ MVH==/
QDM/
SBJB/
TJMN/ TXT/ TXWT/
YPWN/
""".strip().split()
)

In [None]:
personal_lexemes = set(
    """
>B/ >CH/ >DM/ >DRGZR/ >DWN/ >JC/ >J=/ >KR/ >LJL/ >LMN=/ >LMNH/ >LMNJ/ >LWH/ >LWP/ >M/ 
>MH/ >MN==/ >MWN=/ >NC/ >NWC/ >PH/ >PRX/ >SJR/ >SJR=/ >SP/ >X/ >XCDRPN/
>XWH/ >XWT/
<BDH=/ <CWQ/ <D=/ <DH=/ <LMH/ <LWMJM/ <M/ <MD/ <MJT/ <QR=/ <R/ <WJL/ <WL/ <WL==/ <WLL/
<WLL=/ <YRH/
B<L/ B<LH/ BKJRH/ BKR/ BN/ BR/ BR===/ BT/ BTWLH/ BWQR/ BXRJM/ BXWN/ BXWR/
CD==/ CDH/ CGL/ CKN/ CLCJM/ CLJC=/ CMRH=/ CPXH/ CW<R/ CWRR/
DJG/ DWD/ DWDH/ DWG/ DWR/
F<JR=/ FB/ FHD/ FR/ FRH/ FRJD/ FVN/
GBJRH/ GBR/ GBR=/ GBRT/ GLB/ GNB/ GR/ GW==/ GWJ/ GZBR/
HDBR/ 
J<RH/ JBM/ JBMH/ JD<NJ/ JDDWT/ JLD/ JLDH/ JLJD/ JRJB/ JSWR/ JTWM/ JWYR/
JYRJM/ 
KCP=/ KHN/ KLH/ KMR/ KN<NJ=/ KNT/ KRM=/ KRWB/ KRWZ/
L>M/ LHQH/ LMD/ LXNH/
M<RMJM/ M>WRH/ MCBR/ MCJX/ MCM<T/ MCMR/ MCPXH/ MCQLT/ MD<=/ MD<T/ MG/
MJNQT/ MKR=/ ML>K/ MLK/ MLKH/ MLKT/ MLX=/ MLYR/ MMZR/ MNZRJM/ MPLYT/ MYRJ/
MPY=/ MQHL/ MQY<H/ MR</ MR>/ MSGR=/ MT/ MWRH/ MYBH=/
N<R/ N<R=/ N<RH/ N<RWT/ N<WRJM/ NBJ>/ NBJ>H/ NCJN/ NFJ>/ NGJD/ NJN/ NKD/ 
NKR/ NPC/ NPJLJM/ NQD/ NSJK/ NTJN/ 
PLGC/ PLJL/ PLJV/ PLJV=/ PQJD/ PR<H/ PRC/ PRJY/ PRJY=/ PRTMJM/ PRZWN/ 
PSJL/ PSL/ PVR/ PVRH/ PXH/ PXR/
QBYH/ QCRJM/ QCT=/ QHL/ QHLH/ QHLT/ QJM/ QYJN/
R<H=/ R<H==/ R<JH/ R<=/ R<WT/ R>H/ RB</ RB=/ RB==/ RBRBNJN/ RGMH/ RHB/ RKB=/
RKJL/ RMH/ RQX==/ 
SBL/ SPR=/ SRJS/ SRK/ SRNJM/ 
T<RWBWT/ TLMJD/ TLT=/ TPTJ/ TR<=/ TRCT>/ TRTN/ TWCB/ TWL<H/ TWLDWT/ TWTX/
VBX/ VBX=/ VBXH=/ VPSR/ VPXJM/
WLD/
XBL==/ XBL======/ XBR/ XBR=/ XBR==/ XBRH/ XBRT=/ XJ=/ XLC/ XM=/ XMWT/
XMWY=/ XNJK/ XR=/ XRC/ XRC====/ XRP=/ XRVM/ XTN/ XTP/ XZH=/
Y<JRH/ Y>Y>JM/ YJ/ YJD==/ YJR==/ YR=/ YRH=/ 
ZKWR/ ZMR=/ ZR</
""".strip().split()  # noqa W291
)

In[27]:

In [None]:
utils.caption(4, "Determinig kind of complements")

In [None]:
complements_c = collections.defaultdict(lambda: collections.defaultdict(lambda: []))
complements = {}
complementk = {}
kcomplements = collections.Counter()

In [None]:
nphrases = 0
ncomplements = 0

In [None]:
for c in clause_verb:
    for p in L.d(c, "phrase"):
        nphrases += 1
        pf = pf_corr.get(p, F.function.v(p))
        if pf not in complfuncs:
            continue
        ncomplements += 1
        words = L.d(p, "word")
        lexemes = [F.lex.v(w) for w in words]
        lexeme_set = set(lexemes)

        # measuring locativity
        lex_locativity = len(locative_lexemes & lexeme_set)
        prep_b = len([x for x in lexeme_set if x == "B"])
        topo = len([x for x in words if F.nametype.v(x) == "topo"])
        h_loc = len([x for x in words if F.uvf.v(x) == "H"])
        body_part = 0
        if (
            len(words) > 1
            and F.lex.v(words[0]) == "L"
            and F.lex.v(words[1]) in body_parts
        ):
            body_part = 2
        loca = lex_locativity + topo + prep_b + h_loc + body_part

        # measuring indirect object
        prep_l = len(
            [
                x
                for x in words
                if F.lex.v(x) in cmpl_as_iobj_preps and F.prs.v(x) not in no_prs
            ]
        )
        prep_lpr = 0
        lwn = len(words)
        for (n, wn) in enumerate(words):
            if F.lex.v(wn) in cmpl_as_iobj_preps:
                if n + 1 < lwn:
                    nextw = words[n + 1]
                    if (
                        F.lex.v(nextw) in personal_lexemes
                        or F.ls.v(nextw) == "gntl"
                        or (F.sp.v(nextw) == "nmpr" and F.nametype.v(nextw) == "pers")
                    ):
                        prep_lpr += 1
        indi = prep_l + prep_lpr

        # the verdict
        ckind = "C"
        if loca == 0 and indi > 0:
            ckind = "I"
        elif loca > 0 and indi == 0:
            ckind = "L"
        elif loca > indi + 1:
            ckind = "L"
        elif loca < indi - 1:
            ckind = "I"
        complementk[p] = (loca, indi, ckind)
        kcomplements[ckind] += 1
        complements_c[c][ckind].append(p)
        complements[p] = (pf, ckind)

In [27]:
utils.caption(0, "\tDone")
for (label, n) in sorted(kcomplements.items(), key=lambda y: -y[1]):
    utils.caption(0, "\tPhrases of kind {:<2}: {:>6}".format(label, n))
utils.caption(0, "\tTotal complements : {:>6}".format(ncomplements))
utils.caption(0, "\tTotal phrases     : {:>6}".format(nphrases))

..............................................................................................
.      4m 45s Determinig kind of complements                                                 .
..............................................................................................
|      4m 46s 	Done
|      4m 46s 	Phrases of kind C :  16806
|      4m 46s 	Phrases of kind L :  12365
|      4m 46s 	Phrases of kind I :   7838
|      4m 46s 	Total complements :  37009
|      4m 46s 	Total phrases     : 214525


In[28]:

In [None]:
def has_L(vl, pn):
    words = L.d(pn, "word")
    return len(words) > 0 and F.lex.v(words[0] == "L")

In [None]:
def is_lex_personal(vl, pn):
    words = L.d(pn, "word")
    return len(words) > 1 and (
        F.lex.v(words[1]) in personal_lexemes or F.nametype.v(words[1]) == "pers"
    )

In [None]:
def is_lex_local(vl, pn):
    words = L.d(pn, "word")
    return len({F.lex.v(w) for w in words} & locative_lexemes) > 0

In [28]:
def has_H_locale(vl, pn):
    words = L.d(pn, "word")
    return len({w for w in words if F.uvf.v(w) == "H"}) > 0

## 2.4 Generic logic

This is the function that applies the generic rules about (in)direct objects and locatives.
It takes a phrase node and a set of new label values, and modifies those values.

In[29]:

In [None]:
grule_as_str = {
    "pdos": """direct_object => principal_direct_object""",
    "pdos-x": """non-object => principal_direct_object""",
    "ndos": """direct_object => NP_direct_object""",
    "ndos-x": """non-object => NP_direct_object""",
    "dos": """non-object => direct_object""",
    "ldos": """non-object => L_object""",
    "kdos": """non-object => K_object""",
    "inds-c": """complement => indirect_object""",
    "locs-c": """complement => location""",
    "inds-p": """predicate complement => indirect_object""",
    "locs-p": """predicate complement => location""",
    "cdos": """direct-object =(superfluously)=> direct object (clause)""",
    "cdos-x": """non-object => direct object (clause)""",
    "idos": """infinitive_object =(superfluously)=> infinitive_object (clause)""",
    "idos-x": """infinitive clause => infinitive_object""",
}

In [None]:
def rule_as_str_g(x, i):
    return "{}-{}".format(i, grule_as_str[i])

In [None]:
rule_as_str = dict(
    generic=rule_as_str_g,
)

In [None]:
def generic_logic_p(pn, values):
    gl = None
    if pn in objects["principal"]:
        oldv = values["grammatical"]
        if oldv == "direct_object":
            gl = "pdos"
        else:
            gl = "pdos-x"
            values["original"] = oldv
        values["grammatical"] = "principal_direct_object"
    elif pn in objects["NP"]:
        oldv = values["grammatical"]
        if oldv == "direct_object":
            gl = "ndos"
        else:
            gl = "ndos-x"
            values["original"] = oldv
        values["grammatical"] = "NP_direct_object"
    elif pn in objects["direct"]:
        oldv = values["grammatical"]
        if oldv != "direct_object":
            gl = "dos"
            values["original"] = oldv
            values["grammatical"] = "direct_object"
    elif pn in objects["L"]:
        oldv = values["grammatical"]
        gl = "ldos"
        values["original"] = oldv
        values["grammatical"] = "L_object"
    elif pn in objects["K"]:
        oldv = values["grammatical"]
        gl = "kdos"
        values["original"] = oldv
        values["grammatical"] = "K_object"
    elif pn in complements:
        (pf, ck) = complements[pn]
        if ck in {"I", "L"}:
            if pf == "Cmpl":
                if ck == "I":
                    values["grammatical"] = "indirect_object"
                    gl = "inds-c"
                else:
                    values["lexical"] = "location"
                    values["semantic"] = "location"
                    gl = "locs-c"
            elif pf == "PreC":
                if ck == "I":
                    values["grammatical"] = "indirect_object"
                    gl = "inds-p"
                else:
                    values["lexical"] = "location"
                    values["semantic"] = "location"
                    gl = "locs-p"
    return gl

In [None]:
def generic_logic_c(cn, values):
    gl = None
    if cn in objects["clause"]:
        oldv = values["grammatical"]
        if oldv == "direct_object":
            gl = "cdos"
        else:
            gl = "cdos-x"
            values["original"] = oldv
        values["grammatical"] = "direct_object"
    elif cn in objects["infinitive"]:
        oldv = values["grammatical"]
        if oldv == "infinitive_object":
            gl = "idos"
        else:
            gl = "idos-x"
            values["original"] = oldv
        values["grammatical"] = "infinitive_object"
    return gl

In [29]:
generic_logic = dict(
    phrase=generic_logic_p,
    clause=generic_logic_c,
)

## 2.5 Verb specific rules

The verb-specific enrichment rules are stored in a dictionary, keyed  by the verb lexeme.
The rule itself is a list of items.

The last item is a tuple of conditions that need to be fulfilled to apply the rule.

A condition can take the shape of

* a function, taking a phrase or clause node as argument and returning a boolean value
* an BHSA feature for phrases or clauses : value,
  which is true if and only if that feature has that value for the phrase or clause in question

In[30]:

In [30]:
dbl_obj_rules = (
    (
        ("semantic", "benefactive"),
        ("function:Adju", has_L, is_lex_personal),
    ),
    (
        ("lexical", "location"),
        ("function:Cmpl", has_H_locale),
    ),
    (
        ("lexical", "location"),
        ("semantic", "location"),
        ("function:Cmpl", is_lex_local),
    ),
)
enrich_logic = dict(
    phrase={
        "CJT": dbl_obj_rules,
        "FJM": dbl_obj_rules,
    },
    clause={},
)

In[31]:

In [None]:
rule_index = collections.defaultdict(lambda: [])

In [None]:
def rule_as_str_s(vl, i):
    (conditions, sfassignments) = rule_index[vl][i]
    label = "{}-{}\n".format(vl, i + 1)
    rule = "\tIF   {}".format(
        "\n\tAND  ".join(
            "{:<10} = {:<8}".format(*c.split(":"))
            if type(c) is str
            else "{:<15}".format(c.__name__)
            for c in conditions
        )
    )
    ass = []
    for (i, sfa) in enumerate(sfassignments):
        ass.append("\t\t{:<10} => {:<15}\n".format(*sfa))
    return "{}{}\n\tTHEN\n{}".format(label, rule, "".join(ass))

In [None]:
rule_as_str["specific"] = rule_as_str_s

In [None]:
def check_logic():
    utils.caption(4, "Checking enrichment logic")
    errors = 0
    nrules = 0
    for kind in sorted(enrich_logic):
        for vl in sorted(enrich_logic[kind]):
            for items in enrich_logic[kind][vl]:
                rule_index[vl].append((items[-1], items[0:-1]))
            for (i, (conditions, sfassignments)) in enumerate(rule_index[vl]):
                if not SCRIPT:
                    utils.caption(0, rule_as_str_s(vl, i), continuation=True)
                nrules += 1
                for (sf, sfval) in sfassignments:
                    if sf not in enrich_fields:
                        utils.caption(
                            0,
                            'ERROR: {}: "{}" not a valid enrich field'.format(kind, sf),
                            continuation=True,
                        )
                        errors += 1
                    elif sfval not in enrich_fields[sf]:
                        utils.caption(
                            0,
                            'ERROR: {}: `{}`: "{}" not a valid enrich field value'.format(
                                kind, sf, sfval
                            ),
                            continuation=True,
                        )
                        errors += 1
                for c in conditions:
                    if type(c) == str:
                        x = c.split(":")
                        if len(x) != 2:
                            utils.caption(
                                0,
                                "ERROR: {}: Wrong feature condition {}".format(kind, c),
                                continuation=True,
                            )
                            errors += 1
                        else:
                            (feat, val) = x
                            if feat not in legal_values:
                                utils.caption(
                                    0,
                                    "ERROR: {}: Feature `{}` not in use".format(
                                        kind, feat
                                    ),
                                    continuation=True,
                                )
                                errors += 1
                            elif val not in legal_values[feat]:
                                utils.caption(
                                    0,
                                    'ERROR: {}: Feature `{}`: not a valid value "{}"'.format(
                                        kind, feat, val
                                    ),
                                    continuation=True,
                                )
                                errors += 1
    if errors:
        utils.caption(
            0, "\tERROR: There were {} errors in {} rules".format(errors, nrules)
        )
    else:
        utils.caption(0, "\tAll {} rules OK".format(nrules))

In [31]:
check_logic()

..............................................................................................
.      5m 30s Checking enrichment logic                                                      .
..............................................................................................
CJT-1
	IF   function   = Adju    
	AND  has_L          
	AND  is_lex_personal
	THEN
		semantic   => benefactive    

CJT-2
	IF   function   = Cmpl    
	AND  has_H_locale   
	THEN
		lexical    => location       

CJT-3
	IF   function   = Cmpl    
	AND  is_lex_local   
	THEN
		lexical    => location       
		semantic   => location       

FJM-1
	IF   function   = Adju    
	AND  has_L          
	AND  is_lex_personal
	THEN
		semantic   => benefactive    

FJM-2
	IF   function   = Cmpl    
	AND  has_H_locale   
	THEN
		lexical    => location       

FJM-3
	IF   function   = Cmpl    
	AND  is_lex_local   
	THEN
		lexical    => location       
		semantic   => location       

|      5m 30s 	All 6 rules OK


In[32]:

In [None]:
rule_cases = collections.defaultdict(lambda: collections.defaultdict(lambda: {}))

In [32]:
def apply_logic(kind, vl, n, init_values):
    values = deepcopy(init_values)
    gr = generic_logic[kind](n, values)
    if gr:
        rule_cases["generic"][kind].setdefault(("", gr), []).append(n)
    verb_rules = enrich_logic[kind].get(vl, [])
    for (i, items) in enumerate(verb_rules):
        conditions = items[-1]
        sfassignments = items[0:-1]

        ok = True
        for condition in conditions:
            if type(condition) is str:
                (feature, value) = condition.split(":")
                if feature == "function" and kind == "phrase":
                    fval = pf_corr.get(n, F.function.v(n))
                else:
                    fval = F.item[feature].v(n)
                this_ok = fval == value
            else:
                this_ok = condition(vl, n)
            if not this_ok:
                ok = False
                break
        if ok:
            for (sf, sfval) in sfassignments:
                values[sf] = sfval
            rule_cases["specific"][kind].setdefault((vl, i), []).append(n)
    return tuple(values[sf] for sf in enrich_fields)

# 2.6 Generate enrichments

First we generate enriched values for all relevant phrases.
The generated enrichment values are computed on the basis of generic logic.
Additionally, verb-bound logic is applied, if it has been specified.

We store the enriched features in a dictionary, first keyed by the type of constituent that
receives the enrichments (`phrase` or `clause`), and then by the node number of the constituent.

In[33]:

In [None]:
utils.caption(4, "Generating enrichments")

In [None]:
seen = collections.defaultdict(collections.Counter)
enrichFields = dict()

In [None]:
def gen_enrich(verb):
    clauses_seen = set()

    for wn in occs[verb]:
        cn = L.u(wn, "clause")[0]
        if cn in clauses_seen:
            continue
        clauses_seen.add(cn)
        vl = F.lex.v(wn).rstrip("[=")
        for pn in L.d(cn, "phrase"):
            seen["phrase"][pn] += 1
            pf = pf_corr.get(pn, F.function.v(pn))
            enrichFields[pn] = apply_logic("phrase", vl, pn, transform["phrase"][pf])
        for scn in clause_objects[cn]:
            seen["clause"][scn] += 1
            scty = F.typ.v(scn)
            scr = F.rela.v(scn)
            enrichFields[scn] = apply_logic(
                "clause", vl, scn, transform["clause"][scr if scr == "Objc" else scty]
            )

In [33]:
for verb in verb_clause_index:
    gen_enrich(verb)
utils.caption(
    0, "\tGenerated enrichment values for {} verbs:".format(len(verb_clause_index))
)
utils.caption(0, "\tEnriched values for {:>5} nodes".format(len(enrichFields)))

..............................................................................................
.      5m 52s Generating enrichments                                                         .
..............................................................................................
|      5m 56s 	Generated enrichment values for 1380 verbs:
|      5m 56s 	Enriched values for 221353 nodes


In[34]:

In [None]:
utils.caption(0, "\tOverview of rule applications:")

In [None]:
for scope in rule_cases:
    totalscope = 0
    for kind in rule_cases[scope]:
        utils.caption(0, "{}-{} rules:".format(scope, kind))
        totalkind = 0
        for rule_spec in rule_cases[scope][kind]:
            cases = rule_cases[scope][kind][rule_spec]
            n = len(cases)
            totalscope += n
            totalkind += n
            if not SCRIPT:
                if scope == "generic":
                    utils.caption(
                        0,
                        "{:>4} x\n\t{}\n\t{}\n".format(
                            n,
                            rule_as_str[scope](*rule_spec),
                            ", ".join(str(c) for c in cases[0:10]),
                        ),
                    )
                else:
                    utils.caption(
                        0,
                        "{:>4} x\n\t{}\n\t{}\n".format(
                            n,
                            rule_as_str[scope](*rule_spec),
                            ", ".join(str(c) for c in cases[0:10]),
                        ),
                    )
        utils.caption(0, "{:>6} {}-{} rule applications".format(totalkind, scope, kind))
    utils.caption(0, "{:>6} {} rule applications".format(totalscope, scope))

In [34]:
for kind in seen:
    stats = collections.Counter()
    for (node, times) in seen[kind].items():
        stats[times] += 1
    if not SCRIPT:
        for (times, n) in sorted(stats.items(), key=lambda y: (-y[1], y[0])):
            utils.caption(0, "\t{:>6} {} seen {:<2} time(s)".format(n, kind, times))
    utils.caption(0, "\t{:>6} {} seen in total".format(len(seen[kind]), kind))

|      6m 02s 	Overview of rule applications:
|      6m 02s generic-phrase rules:
|      6m 02s 1098 x
	ndos-direct_object => NP_direct_object
	651871, 652852, 784482, 788915, 887094, 668755, 677667, 680588, 696396, 700200

|      6m 02s 3871 x
	pdos-direct_object => principal_direct_object
	651873, 652853, 784483, 786606, 788914, 744784, 746449, 771047, 876074, 887093

|      6m 02s  625 x
	locs-p-predicate complement => location
	783707, 651621, 651984, 652623, 653139, 653425, 653438, 653473, 653798, 653929

|      6m 02s 11750 x
	locs-c-complement => location
	813424, 651625, 652673, 652692, 653902, 654097, 654101, 654150, 654153, 654156

|      6m 02s 4186 x
	ldos-non-object => L_object
	651731, 651734, 651914, 654024, 655756, 656050, 658720, 659683, 660594, 663717

|      6m 02s 6170 x
	inds-c-complement => indirect_object
	651912, 653331, 653809, 654033, 654044, 654255, 654261, 655715, 657058, 657490

|      6m 02s  434 x
	inds-p-predicate complement => indirect_object
	652018, 6

For selected verbs, we write the enrichments to spreadsheets.

In[35]:

In [None]:
COMMON_FIELDS = """
    cnode#
    vnode#
    onode#
    book
    chapter
    verse
    verb_lexeme
    verb_stem
    verb_occurrence
    text
    constituent
""".strip().split()

In [None]:
PHRASE_FIELDS = """
    type
    function
""".strip().split()

In [None]:
CLAUSE_FIELDS = """
    type
    rela
""".strip().split()

In [35]:
field_names = COMMON_FIELDS + CLAUSE_FIELDS + PHRASE_FIELDS + list(enrich_fields)
pfillrows = len(CLAUSE_FIELDS)
cfillrows = len(PHRASE_FIELDS)
fillrows = pfillrows + cfillrows + len(enrich_fields)
if not SCRIPT:
    print("\n".join(field_names))

cnode#
vnode#
onode#
book
chapter
verse
verb_lexeme
verb_stem
verb_occurrence
text
constituent
type
rela
type
function
valence
predication
grammatical
original
lexical
semantic


In[36]:

In [None]:
utils.caption(4, "Generate blank enrichment sheets")
sheetKind = "enrich_blank"
utils.caption(0, "\tas {}".format(vfile("{verb}", sheetKind)[1]))

In [None]:
def gen_sheet_enrich(verb):
    rows = []
    fieldsep = ";"
    clauses_seen = set()
    for wn in occs[verb]:
        cn = L.u(wn, "clause")[0]
        if cn in clauses_seen:
            continue
        clauses_seen.add(cn)
        (book_name, chapter, verse) = T.sectionFromNode(cn, lang="la")
        book = T.sectionFromNode(cn)[0]
        vl = F.lex.v(wn).rstrip("[=")
        vstem = F.vs.v(wn)
        vt = T.text([wn], fmt="text-trans-plain")
        ct = T.text(L.d(cn, "word"), fmt="text-trans-plain")

        common_fields = (cn, wn, -1, book, chapter, verse, vl, vstem, vt, ct, "")
        rows.append(common_fields + (("",) * fillrows))
        for pn in L.d(cn, "phrase"):
            seen["phrase"][pn] += 1
            pt = T.text(L.d(pn, "word"), fmt="text-trans-plain")
            common_fields = (
                cn,
                wn,
                pn,
                book,
                chapter,
                verse,
                vl,
                vstem,
                "",
                pt,
                "phrase",
            )
            pty = F.typ.v(pn)
            pf = pf_corr.get(pn, F.function.v(pn))
            phrase_fields = ("",) * pfillrows + (pty, pf) + enrichFields[pn]
            rows.append(common_fields + phrase_fields)
        for scn in clause_objects[cn]:
            seen["clause"][scn] += 1
            sct = T.text(L.d(scn, "word"), fmt="text-trans-plain")
            common_fields = (
                cn,
                wn,
                scn,
                book,
                chapter,
                verse,
                vl,
                vstem,
                "",
                sct,
                "clause",
            )
            scty = F.typ.v(scn)
            scr = F.rela.v(scn)
            clause_fields = (scty, scr) + ("",) * cfillrows + enrichFields[scn]
            rows.append(common_fields + clause_fields)

    location = vfile(verb, sheetKind)
    if location is None:
        return
    (baseName, fileName) = location

    row_file = open(fileName, "w")
    row_file.write("{}\n".format(fieldsep.join(field_names)))
    for row in rows:
        row_file.write("{}\n".format(fieldsep.join(str(x) for x in row)))
    row_file.close()
    utils.caption(0, "\t\tfor verb {} ({:>5} rows)".format(verb, len(rows)))

In [None]:
for verb in verbs:
    gen_sheet_enrich(verb)

In [None]:
utils.caption(0, "\tDone")

In[37]:

In [37]:
def showcase(n):
    otype = F.otype.v(n)
    att1 = pf_corr.get(n, F.function.v(n)) if otype == "phrase" else F.rela.v(n)
    att2 = F.typ.v(n)
    utils.caption(
        0,
        """{} ({}-{}) {}\n{} {}:{}    {}\n""".format(
            otype,
            att1,
            att2,
            T.text(L.d(n, "word"), fmt="text-trans-plain"),
            *T.sectionFromNode(n),
            T.text(L.d(L.u(n, "verse")[0], "word"), fmt="text-trans-plain"),
        ),
        continuation=True,
    )

In[38]:

In [38]:
if not SCRIPT:
    showcase(654844)
    showcase(445014)
    # showcase(426954)

phrase (Pred-VP) JKLW 
Genesis 13:6    WL>&NF> >TM H>RY LCBT JXDW KJ&HJH RKWCM RB WL> JKLW LCBT JXDW00 

clause (Coor-WxY0) WLW&>TN >T&H>RY WLBNJW 
Deuteronomium 1:36    ZWLTJ KLB BN&JPNH HW> JR>NH WLW&>TN >T&H>RY >CR DRK&BH WLBNJW J<N >CR ML> >XRJ JHWH00 



In[39]:

In [None]:
def check_h(vl, show_results=False):
    hl = {}
    total = 0
    for w in F.otype.s("word"):
        if F.sp.v(w) != "verb" or F.lex.v(w).rstrip("[=/") != vl:
            continue
        total += 1
        c = L.u(w, "clause")[0]
        ps = L.d(c, "phrase")
        phs = {
            p for p in ps if len({w for w in L.d(p, "word") if F.uvf.v(w) == "H"}) > 0
        }
        for f in ("Cmpl", "Adju", "Loca"):
            phc = {
                p
                for p in ps
                if pf_corr.get(p, None) or (pf_corr.get(p, F.function.v(p))) == f
            }
            if len(phc & phs):
                hl.setdefault(f, set()).add(w)
    for f in hl:
        utils.caption(
            0,
            "Verb {}: {} occurrences. He locales in {} phrases: {}".format(
                vl, total, f, len(hl[f])
            ),
            continuation=True,
        )
        if show_results:
            utils.caption(
                0, "\t{}".format(", ".join(str(x) for x in hl[f])), continuation=True
            )

In [39]:
if not SCRIPT:
    check_h("BW>", show_results=True)

Verb BW>: 2570 occurrences. He locales in Cmpl phrases: 164
	257028, 184327, 26120, 256523, 26129, 146458, 187932, 197150, 95265, 272418, 24618, 184362, 398381, 289838, 201265, 136756, 78903, 100418, 5699, 32835, 100421, 401474, 198220, 200270, 24655, 100946, 24660, 112215, 141913, 186972, 196702, 28770, 34406, 298606, 248943, 132209, 162929, 12403, 5748, 146055, 90249, 153227, 396940, 97427, 134803, 151187, 188054, 257177, 21658, 426137, 136349, 162981, 24742, 200361, 214699, 25779, 257204, 158389, 4790, 100535, 4794, 160445, 214719, 90818, 272581, 139974, 249032, 38601, 113869, 8921, 138459, 19168, 20705, 26852, 282853, 8425, 8938, 43241, 145138, 170740, 397045, 254715, 154365, 79615, 200960, 27393, 176387, 165637, 206087, 208648, 426247, 269581, 26382, 106254, 157458, 257305, 149796, 170793, 211244, 26416, 27440, 126769, 9526, 246074, 109371, 172351, 249152, 26433, 4930, 16706, 64834, 398147, 168782, 154975, 132966, 47466, 393583, 47472, 157552, 37238, 100214, 23417, 269182, 23935, 

It would be handy to generate an informational spreadsheet that shows all these cases.

## 2.6 Process the filled in enrichments

We read the enrichments, and perform some consistency checks.
If the filled-in sheet does not exist, we take the blank sheet, with the default assignment of the new features.
If a phrase got conflicting features, because it occurs in sheets for multiple verbs, the values in the filled-in sheet take precedence over the values in the blank sheet. If both occur in a filled in sheet, a warning will be issued.

In[43]:

In [43]:
def read_enrich():
    of_enriched = {
        False: {},  # for enrichments found in blank sheets
        True: {},  # for enrichments found in filled sheets
    }
    repeated = {
        False: collections.defaultdict(list),  # for blank sheets
        True: collections.defaultdict(list),  # for filled sheets
    }
    wrong_value = {
        False: collections.defaultdict(list),
        True: collections.defaultdict(list),
    }

    non_match = collections.defaultdict(list)
    wrong_node = collections.defaultdict(list)

    results = []
    dev_results = []  # results that deviate from the filled sheet

    ERR_LIMIT = 10

    ev = "text-trans-plain"

    for verb in sorted(verbs):
        vresults = {
            False: {},  # for blank sheets
            True: {},  # for filled sheets
        }
        for check in (
            (False, "blank"),
            (True, "filled"),
        ):
            is_filled = check[0]

            location = vfile(verb, "enrich_{}".format(check[1]))
            if location is None:
                continue
            (baseName, fileName) = location

            if not os.path.exists(fileName):
                if not is_filled:
                    utils.caption(
                        0, "\tNO {} enrichment sheet for {}".format(check[1], baseName)
                    )
                continue
            utils.caption(0, "\t{} enrichment sheet for {}".format(check[1], baseName))

            with open(fileName) as fh:
                fh.__next__()
                for line in fh:
                    fields = line.rstrip().split(";")
                    on = int(fields[2])
                    if on < 0:
                        continue
                    kind = fields[10]
                    objects_seen[kind][on] += 1
                    vvals = tuple(fields[-nef:])
                    for (f, v) in zip(enrich_fields, vvals):
                        if (
                            v != ""
                            and v != "X"
                            and v != "NA"
                            and v not in enrich_fields[f]
                        ):
                            wrong_value[is_filled][on].append((verb, f, v))
                    vresults[is_filled][on] = vvals
                    if on in of_enriched[is_filled]:
                        if on not in repeated[is_filled]:
                            repeated[is_filled][on] = [of_enriched[is_filled][on]]
                        repeated[is_filled][on].append((verb, vvals))
                    else:
                        of_enriched[is_filled][on] = (verb, vvals)
                    if F.otype.v(on) != kind:
                        non_match[on].append((verb, kind))
            for on in sorted(
                vresults[True]
            ):  # check whether the phrase ids are not mangled
                if on not in vresults[False]:
                    wrong_node[on].append(verb)
            for on in sorted(
                vresults[False]
            ):  # now collect all results, give precedence to filled values
                if F.otype.v(on) == "phrase":
                    f_corr = on in pf_corr  # manual correction in phrase function
                    f_good = pf_corr.get(on, F.function.v(on))
                else:
                    f_corr = ""
                    f_good = ""
                s_manual = (
                    on in vresults[True] and vresults[False][on] != vresults[True][on]
                )  # real change

                # here we determine which value is going to be put in a feature
                # basic rule: if there is an filled-in sheet, take the value from there, else from the blank one
                # exception:
                # if a value is empty in the filled-in sheet, but not in the blank one, take the non-empty one
                #
                # Why? Well, sometimes we improve the enrich logic. There may be filled-in sheets based on older
                # blank sheets.
                # We want to push new values in blank sheets through unfilled in values in the filled sheets.
                # If it is intentional to remove a value from the blank sheet,
                # you can put an X in the corresponding filled field.
                blank_results = vresults[False][on]
                these_results = []

                for (i, br) in enumerate(blank_results):
                    the_value = br
                    if s_manual and vresults[True][on][i] != "":
                        the_value = vresults[True][on][i]
                        if the_value == "X":
                            the_value = ""
                    these_results.append(the_value)
                these_results = tuple(these_results)

                # these_results = vresults[True][on] if s_manual else vresults[False][on]

                if f_corr or s_manual:
                    dev_results.append(
                        (on,) + these_results + (f_good, f_corr, s_manual)
                    )
                results.append((on,) + these_results + (f_good, f_corr, s_manual))

    for check in (
        (False, "blank"),
        (True, "filled"),
    ):
        if len(wrong_value[check[0]]):  # illegal values in sheets
            wrongs = wrong_value[check[0]]
            for x in sorted(wrongs)[0:ERR_LIMIT]:
                px = T.text(L.d(x, "word"), fmt=ev)
                ref_node = L.u(x, "clause")[0] if F.otype.v(x) != "clause" else x
                cx = T.text(L.d(ref_node, "word"), fmt=ev)
                passage = T.sectionFromNode(x)
                utils.caption(
                    0,
                    "ERROR: {} Illegal value(s) in {}: {} = {} in {}:".format(
                        passage, check[1], x, px, cx
                    ),
                    continuation=True,
                )
                for (verb, f, v) in wrongs[x]:
                    utils.caption(
                        0,
                        'ERROR: \t"{}" is an illegal value for "{}" in verb {}'.format(
                            v,
                            f,
                            verb,
                        ),
                        continuation=True,
                    )
            ne = len(wrongs)
            if ne > ERR_LIMIT:
                utils.caption(
                    0,
                    " ... AND {} CASES MORE".format(ne - ERR_LIMIT),
                    continuation=True,
                )
        else:
            utils.caption(
                0,
                "\tOK: The used {} enrichment sheets have legal values".format(
                    check[1]
                ),
            )

        nerrors = 0
        if len(repeated[check[0]]):  # duplicates in sheets, check consistency
            repeats = repeated[check[0]]
            for x in sorted(repeats):
                overview = collections.defaultdict(list)
                for y in repeats[x]:
                    overview[y[1]].append(y[0])
                px = T.text(L.d(x, "word"), fmt=ev)
                ref_node = L.u(x, "clause")[0] if F.otype.v(x) != "clause" else x
                cx = T.text(L.d(ref_node, "word"), fmt=ev)
                passage = T.sectionFromNode(x)
                if len(overview) > 1:
                    nerrors += 1
                    if nerrors < ERR_LIMIT:
                        utils.caption(
                            0,
                            "ERROR: {} Conflict in {}: {} = {} in {}:".format(
                                passage, check[1], x, px, cx
                            ),
                            continuation=True,
                        )
                        for vals in overview:
                            utils.caption(
                                0,
                                "\t{:<40} in verb(s) {}".format(
                                    ", ".join(vals),
                                    ", ".join(overview[vals]),
                                ),
                                continuation=True,
                            )
                elif False:  # for debugging purposes
                    # else:
                    nerrors += 1
                    if nerrors < ERR_LIMIT:
                        utils.caption(
                            0,
                            "\t{} Agreement in {} {} = {} in {}: {}".format(
                                passage,
                                check[1],
                                x,
                                px,
                                cx,
                                ",".join(list(overview.values())[0]),
                            ),
                            continuation=True,
                        )
            ne = nerrors
            if ne > ERR_LIMIT:
                utils.caption(
                    0,
                    " ... AND {} CASES MORE".format(ne - ERR_LIMIT),
                    continuation=True,
                )
        if nerrors == 0:
            utils.caption(
                0, "\tOK: The used {} enrichment sheets are consistent".format(check[1])
            )

    if len(non_match):
        utils.caption(
            0, "ERROR: Enrichments have been applied to nodes with non-matching types:"
        )
        for x in sorted(non_match)[0:ERR_LIMIT]:
            (verb, shouldbe) = non_match[x]
            px = T.text(L.d(x, "word"), fmt=ev)
            utils.caption(
                0,
                "ERROR: {}: {} Node {} is not a {} but a {}".format(
                    verb,
                    T.sectionFromNode(x),
                    x,
                    shouldbe,
                    F.otype.v(x),
                ),
                continuation=True,
            )
        ne = len(non_match)
        if ne > ERR_LIMIT:
            utils.caption(
                0, " ... AND {} CASES MORE".format(ne - ERR_LIMIT), continuation=True
            )
    else:
        utils.caption(0, "\tOK: all enriched nodes where phrase nodes")

    if len(wrong_node):
        utils.caption(0, "ERROR: Node in filled sheet did not occur in blank sheet:")
        for x in sorted(wrong_node)[0:ERR_LIMIT]:
            px = T.text(L.d(x, "word"), fmt=ev)
            utils.caption(
                0,
                "{}: {} node {}".format(
                    wrong_node[x],
                    T.sectionFromNode(x),
                    x,
                ),
                continuation=True,
            )
        ne = len(wrong_node)
        if ne > ERR_LIMIT:
            utils.caption(
                0, " ... AND {} CASES MORE".format(ne - ERR_LIMIT), continuation=True
            )
    else:
        utils.caption(0, "\tOK: all enriched nodes occurred in the blank sheet")

    if len(dev_results):
        utils.caption(
            0,
            "\tOK: there are {} manual correction/enrichment annotations".format(
                len(dev_results)
            ),
        )
        for r in dev_results[0:ERR_LIMIT]:
            (x, *vals, f_good, f_corr, s_manual) = r
            px = T.text(L.d(x, "word"), fmt=ev)
            cx = T.text(L.d(L.u(x, "clause")[0], "word"), fmt=ev)
            utils.caption(
                0,
                "{:<30} {:>7} => {:<3} {:<3} {}\n\t{}\n\t\t{}".format(
                    "COR" if f_corr else "",
                    "MAN" if s_manual else "",
                    "{} {}:{}".format(*T.sectionFromNode(x)),
                    x,
                    ",".join(vals),
                    px,
                    cx,
                ),
                continuation=True,
            )
        ne = len(dev_results)
        if ne > ERR_LIMIT:
            utils.caption(
                0,
                "... AND {} ANNOTATIONS MORE".format(ne - ERR_LIMIT),
                continuation=True,
            )
    else:
        utils.caption(0, "\tthere are no manual correction/enrichment annotations")
    return results

In[44]:

In [None]:
utils.caption(4, "Processing enrichment sheets ...")
sheetKind = "enrich_filled"

In [44]:
utils.caption(0, "\tas {}".format(vfile("{verb}", sheetKind)[1]))
objects_seen = collections.defaultdict(collections.Counter)
sheetResults = read_enrich()

..............................................................................................
.     16m 49s Processing enrichment sheets ...                                               .
..............................................................................................
|     16m 49s 	as /Users/dirk/github/etcbc/valence/source/2021/enrich_filled/{verb}.csv
|     16m 49s 	blank enrichment sheet for oBR
|     16m 49s 	blank enrichment sheet for oFH
|     16m 49s 	blank enrichment sheet for oLH
|     16m 49s 	blank enrichment sheet for BRa
|     16m 49s 	blank enrichment sheet for BWa
|     16m 49s 	blank enrichment sheet for CJT
|     16m 49s 	blank enrichment sheet for CWB
|     16m 49s 	blank enrichment sheet for FJM
|     16m 49s 	blank enrichment sheet for HLK
|     16m 49s 	blank enrichment sheet for JRD
|     16m 49s 	blank enrichment sheet for JYa
|     16m 49s 	blank enrichment sheet for NFa
|     16m 49s 	blank enrichment sheet for NPL
|     16m 49s 	blank enrichme

In[45]:

In [45]:
if not SCRIPT:
    list(enrichFields.items())[0:10]

In[46]:

In [46]:
if not SCRIPT:
    sheetResults[0:10]

Combine the sheet results with the generic results in one single dictionary, keyed by node number.

In[47]:

In [47]:
utils.caption(4, "Combine the manual results with the generic results")
allResults = dict()
for (n, *features) in sheetResults:
    allResults[n] = features
utils.caption(0, "\tAnnotations from sheets for {} nodes".format(len(allResults)))
utils.caption(
    0, "\tMerging {} annotations from generic enrichment".format(len(enrichFields))
)
for (n, features) in enrichFields.items():
    if n in allResults:
        continue
    allResults[n] = features + ("", "", False)
utils.caption(0, "\tResulting in annotations for {} nodes".format(len(allResults)))

..............................................................................................
.     17m 22s Combine the manual results with the generic results                            .
..............................................................................................
|     17m 22s 	Annotations from sheets for 53743 nodes
|     17m 22s 	Merging 221353 annotations from generic enrichment
|     17m 22s 	Resulting in annotations for 221353 nodes


# 3 Generate data

We write the correction and enrichment data as a data module in text-fabric format.

In[48]:

In [None]:
newFeatures = list(enrich_fields.keys()) + ["function", "f_correction", "s_manual"]

In [None]:
description = dict(
    title="Correction and enrichment features",
    description="Corrections, alternatives and additions to the ETCBC4b encoding of the Hebrew Bible",
    purpose="Support the decision process of assigning valence to verbs",
    method="Generated blank correction and enrichment spreadsheets with selected clauses",
    steps="sheets filled out by researcher; read back in by program; generated new features based on contents",
    author="The content and nature of the features are by Janet Dyk, the workflow is by Dirk Roorda",
    coreData="BHSA",
    coreVersion=VERSION,
)

In [None]:
metaData = {
    "": description,
    "valence": {
        "description": "verbal valence main classification",
    },
    "predication": {
        "description": "verbal function main classification",
    },
    "grammatical": {
        "description": "constituent role main classification",
    },
    "original": {
        "description": "default value before enrichment logic has been applied",
    },
    "lexical": {
        "description": "additional lexical characteristics",
    },
    "semantic": {
        "description": "additional semantic characteristics",
    },
    "f_correction": {
        "description": "whether the phrase function has been manually corrected",
    },
    "s_manual": {
        "description": "whether the generated enrichment features have been manually changed",
    },
    "function": {
        "description": "corrected phrase function, only present for phrases that were in a correction sheet",
    },
}

In [48]:
for f in newFeatures:
    metaData[f]["valueType"] = "str"

In[49]:

In [None]:
nodeFeatures = dict()

In [None]:
for (node, featureVals) in allResults.items():
    for (fName, fVal) in zip(newFeatures, featureVals):
        fValRep = fVal
        if type(fVal) is bool:
            fValRep = "y" if fVal else ""
        nodeFeatures.setdefault(fName, {})[node] = fValRep

In [49]:
RENAMES = [("function", "cfunction")]
for (oldF, newF) in RENAMES:
    for data in (nodeFeatures, metaData):
        data[newF] = data[oldF]
        del data[oldF]

In[50]:

In [50]:
utils.caption(4, "Writing TF enrichment features")
TF = Fabric(locations=thisTempTf, silent=True)
TF.save(nodeFeatures=nodeFeatures, edgeFeatures={}, metaData=metaData)

..............................................................................................
.     17m 42s Writing TF enrichment features                                                 .
..............................................................................................


True

# Diffs

Check differences with previous versions.

In[51]:

In [51]:
utils.checkDiffs(thisTempTf, thisTf, only=set(nodeFeatures))

..............................................................................................
.     17m 47s Check differences with previous version                                        .
..............................................................................................
|     17m 47s 	9 features to add
|     17m 47s 		cfunction
|     17m 47s 		f_correction
|     17m 47s 		grammatical
|     17m 47s 		lexical
|     17m 47s 		original
|     17m 47s 		predication
|     17m 47s 		s_manual
|     17m 47s 		semantic
|     17m 47s 		valence
|     17m 47s 	no features to delete
|     17m 47s 	0 features in common
|     17m 47s Done


# Deliver

Copy the new TF features from the temporary location where they have been created to their final destination.

In[52]:

In [52]:
utils.deliverFeatures(thisTempTf, thisTf, nodeFeatures)

..............................................................................................
.     17m 52s Deliver features to /Users/dirk/github/etcbc/valence/tf/2021                   .
..............................................................................................
|     17m 52s 	valence
|     17m 52s 	predication
|     17m 52s 	grammatical
|     17m 52s 	original
|     17m 52s 	lexical
|     17m 52s 	semantic
|     17m 52s 	f_correction
|     17m 52s 	s_manual
|     17m 52s 	cfunction


# Compile TF

In[53]:

In [None]:
utils.caption(4, "Load and compile the new TF features")

In [53]:
TF = Fabric(locations=[coreTf, thisTf], modules=[""])
api = TF.load(
    """
    lex gloss lex_utf8
    sp vs lex rela typ
    function
"""
    + " ".join(nodeFeatures)
)
api.makeAvailableIn(globals())

..............................................................................................
.     18m 02s Load and compile the new TF features                                           .
..............................................................................................
This is Text-Fabric 8.5.13
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

97 features found and 0 ignored
  0.00s loading features ...
   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API
   |     0.54s T valence              from ~/github/etcbc/valence/tf/2021
   |     0.57s T predication          from ~/github/etcbc/valence/tf/2021
   |     0.49s T grammatical          from ~/github/etcbc/valence/tf/2021
   |     0.32s T original             from ~/github/etcbc/valence/tf/2021
   |     0.41s T lexical              from ~/github/etcbc/valence/tf/2021
   |     0.36s T semantic             from ~/github/etcbc/valence/tf/2021
   |     0.

[('Computed',
  'computed-data',
  ('C Computed', 'Call AllComputeds', 'Cs ComputedString')),
 ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')),
 ('Fabric', 'loading', ('TF',)),
 ('Locality', 'locality', ('L Locality',)),
 ('Nodes', 'navigating-nodes', ('N Nodes',)),
 ('Features',
  'node-features',
  ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')),
 ('Search', 'search', ('S Search',)),
 ('Text', 'text', ('T Text',))]

# Examples
Take the first 10 phrases and retrieve the corrected and uncorrected function feature.
Note that the corrected function feature is only filled in, if it occurs in a clause in which a selected verb occurs.

In[54]:

In [54]:
for i in list(F.otype.s("phrase"))[0:10]:
    print(
        "{} - {} - {}".format(
            F.function.v(i),
            F.cfunction.v(i),
            L.u(i, "clause")[0] in clause_verb,
        )
    )

Time - Time - True
Pred - Pred - True
Subj - Subj - True
Objc - Objc - True
Conj -  - True
Subj -  - True
Pred -  - True
PreC -  - True
Conj - None - False
Subj - None - False


In[55]:

In [55]:
if SCRIPT:
    stop(good=True)

## Results

We put all corrections and enrichments in a single CSV file for checking.

We also generate a smaller CSV, with only the data for selected verbs in it.

In[80]:

In [None]:
f = open(allResults, "w")
g = open(selectedResults, "w")

In [None]:
NALLFIELDS = 17
tpl = ("{};" * (NALLFIELDS - 1)) + "{}\n"

In [None]:
utils.caption(0, "collecting constituents ...")
f.write(
    tpl.format(
        "-",
        "-",
        "passage",
        "verb(s) text",
        "-",
        "-",
        "-",
        "-",
        "-",
        "-",
        "-",
        "-",
        "-",
        "-",
        "-",
        "-",
        "clause text",
        "clause node",
    )
)
f.write(
    tpl.format(
        "corrected",
        "enriched",
        "passage",
        "-",
        "object type",
        "clause rela",
        "clause type",
        "phrase function (old)",
        "phrase function (new)",
        "phrase type",
        "valence",
        "predication",
        "grammatical",
        "original",
        "lexical",
        "semantic",
        "object text",
        "object node",
    )
)
i = 0
h = 0
j = 0
c = 0
d = 0
CHUNK_SIZE = 10000
sel_verbs = set(verbs)
for cn in sorted(clause_verb):
    c += 1
    vrbs = sorted(clause_verb[cn])
    lex_vrbs = {F.lex.v(verb).rstrip("[=") for verb in vrbs}
    selected = len(lex_vrbs & sel_verbs) != 0
    if selected:
        d += 1
        sel_vrbs = [v for v in vrbs if F.lex.v(v).rstrip("[=") in verbs]

        g.write(
            tpl.format(
                "",
                "",
                "{} {}:{}".format(*T.sectionFromNode(cn)),
                " ".join(F.lex.v(verb) for verb in sel_vrbs),
                "",
                "",
                "",
                "",
                "",
                "",
                "",
                "",
                "",
                "",
                "",
                "",
                T.text(L.d(cn, "word"), fmt="text-trans-plain"),
                cn,
            )
        )

    f.write(
        tpl.format(
            "",
            "",
            "{} {}:{}".format(*T.sectionFromNode(cn)),
            " ".join(F.lex.v(verb) for verb in vrbs),
            "",
            "",
            "",
            "",
            "",
            "",
            "",
            "",
            "",
            "",
            "",
            "",
            T.text(L.d(cn, "word"), fmt="text-trans-plain"),
            cn,
        )
    )
    for pn in L.d(cn, "phrase"):
        i += 1
        if selected:
            h += 1
        j += 1
        if j == CHUNK_SIZE:
            j = 0
            utils.caption(
                0,
                "{:>6} selected of {:>6} constituents in {:>5} selected of {:>5} clauses ...".format(
                    h, i, d, c
                ),
            )

        material = tpl.format(
            "COR" if F.f_correction.v(pn) == "y" else "",
            "MAN" if F.s_manual.v(pn) == "y" else "",
            "{} {}:{}".format(*T.sectionFromNode(pn)),
            "",
            "phrase",
            "",
            "",
            F.function.v(pn),
            F.cfunction.v(pn),
            F.typ.v(pn),
            F.valence.v(pn),
            F.predication.v(pn),
            F.grammatical.v(pn),
            F.original.v(pn),
            F.lexical.v(pn),
            F.semantic.v(pn),
            T.text(L.d(pn, "word"), fmt="text-trans-plain"),
            pn,
        )
        f.write(material)
        if selected:
            g.write(material)
    for scn in clause_objects[cn]:
        i += 1
        if selected:
            h += 1
        j += 1
        if j == CHUNK_SIZE:
            j = 0
            utils.caption(0, "{:>6} constituents in {:>5} clauses ...".format(i, c))
        material = tpl.format(
            "",
            "",
            "{} {}:{}".format(*T.sectionFromNode(scn)),
            "",
            "clause",
            F.rela.v(scn),
            F.typ.v(scn),
            "",
            "",
            "",
            F.valence.v(scn),
            F.predication.v(scn),
            F.grammatical.v(scn),
            F.original.v(scn),
            F.lexical.v(scn),
            F.semantic.v(scn),
            T.text(L.d(scn, "word"), fmt="text-trans-plain"),
            scn,
        )
        f.write(material)
        if selected:
            g.write(material)

In [80]:
f.close()
g.close()
utils.caption(
    0,
    "{:>6} selected of {:>6} constituents in {:>5} selected of {:>5} clauses done".format(
        h, i, d, c
    ),
)

    29s collecting constituents ...
    30s   2439 selected of  10000 constituents in   698 selected of  3065 clauses ...
    30s   4980 selected of  20000 constituents in  1444 selected of  6135 clauses ...
    31s   8078 selected of  30000 constituents in  2299 selected of  9096 clauses ...
    32s  10451 selected of  40000 constituents in  2965 selected of 12043 clauses ...
    33s  13045 selected of  50000 constituents in  3714 selected of 15059 clauses ...
    34s  15631 selected of  60000 constituents in  4470 selected of 18127 clauses ...
    35s  18574 selected of  70000 constituents in  5306 selected of 21091 clauses ...
    35s  21265 selected of  80000 constituents in  6133 selected of 24167 clauses ...
    36s  24093 selected of  90000 constituents in  6962 selected of 27182 clauses ...
    37s  26843 selected of 100000 constituents in  7780 selected of 30269 clauses ...
    38s  29681 selected of 110000 constituents in  8674 selected of 33442 clauses ...
    39s  31693 sel

In[81]:

In [81]:
x = 671522
print(pf_corr.get(x, F.function.v(x)))
print(is_lex_local("FJM", x))
print(x in rule_cases["specific"]["phrase"][("FJM", 2)])
print(F.lexical.v(x))

Cmpl
True
False

