# [best viewed in NBviewer](https://nbviewer.jupyter.org/github/ETCBC/heads/blob/master/phrase_heads.ipynb)

# Phrase Heads

This notebook aims to develop a new method of head detection using insights gained from the first version of this data. This new effort improves on the previous one in two main ways:

* head selection is performed using Text Fabric templates, which offers a clearer, more transparent way to select and filter data
* aims to track and address all edge cases

Most of the rationale and rules generated in [getting_heads.ipynb](getting_heads.ipynb) are carried over to this present notebook.

In [81]:
version = '2021' # configure version here

<hr>

In [82]:
from tf.app import use
from tf.fabric import Fabric
from IPython.display import display
import collections, random, csv, re, textwrap
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [83]:
A = use('ETCBC/bhsa', version=version, hoist=globals())
A.api.TF.load('g_cons_utf8 prs', add=True)

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
book,39,10938.21,100
chapter,929,459.19,100
lex,9230,46.22,100
verse,23213,18.38,100
half_verse,45179,9.44,100
sentence,63717,6.7,100
sentence_atom,64514,6.61,100
clause,88131,4.84,100
clause_atom,90704,4.7,100
phrase,253203,1.68,100


True

In [84]:
# configure display
A.displaySetup(condenseType='phrase', withNodes=True, end=50, extraFeatures={'st'})

In [85]:
def show_subphrases(phrase, direction=L.d):
    '''
    A simple function to print subphrases
    and their relations to each other.
    '''
    for sp in direction(phrase, 'subphrase'):
        
        mother = E.mother.f(sp)[0] if E.mother.f(sp) else ''
        mother_text = T.text(mother)
        
        print('-'*7 + str(sp) + '-'*16)
        print()
        print(f'{T.text(sp)} -{F.rela.v(sp)}-> {mother_text}')
        print(f'nodes:  {sp} -{F.rela.v(sp)}-> {mother}')
        print(f'slots:  {L.d(sp, "word")} -{F.rela.v(sp)}-> {L.d(mother or 0, "word")}')
        print('-'*30)

# Defining Heads

A "semantic" head is the primary content word of a phrase, following Croft's "Primary Information Bearing Unit":

> **The noun and the verb are the PRIMARY INFORMATION_BEARING UNITS (PIBUs) of the phrase and clause respectively. In common parlance, they are the content words. PIBUs have major informational content that functional elements such as articles and [auxiliaries] do not have. (Croft, *Radical Construction Grammar*, 2001, 258; see also Shead, *Radical Frame Semantics and Biblical Hebrew*, 104)**

> **A (semantic) head is the profile equivalent that is the primary information-bearing unit, that is, the most contentful item that most closely profiles the same kind of thing that the whole constituent profiles. (ibid., 259)**

Croft also provides an additional criterion to "profile equivalence":

> **If the criterion of profile equivalence produces two candidates for headhood, the less schematic meaning is the PIBU; that is, the PIBU is the one with the narrower extension, in the formal semantic sense of that term (ibid., 259)**

The definition of a semantic head can be compared with the traditional definition of syntactic heads:
- the word with a part of speech after which a phrase type is named
- the word which semantically determines grammatical agreement

In this traditional, syntactic definition, the head of a prepositional phrase is a preposition, whereas a semantic head would be lexical item subsumed under the preposition.

## Capturing Semantic Heads Algorithmically

The problem of selecting semantic heads cannot be comprehensively solved with rules-based methods, as it is a construction- and category- specific item. However, syntactic relations can be used to capture a large majority of semantic heads. This is because syntactic constructions tend to put the primary profiled lexical item in a particular slot. In a noun phrase, for instance, the PIBU is typically found as a stand-alone term that does not function as a modifier (thus excluding adjectives, articles, quantifiers, etc.). In a prepositional phrase, the definition is the same, where the preposition is excluded since it is considered a modifier of the PIBU.

Thus **the goal of this NB, and project, is to provide semantic head data that is accurate for the majority of cases.** This means that even a 60% level of accuracy is deemed a success.

## Use Cases

Semantic head data, since not 100% accurate, should not be used on its own. But it can be used for generating automatic labels that can be hand-checked / adjusted. It can also be useful for cursory explorations and data collection.


# Preprocessing Data

Before beginning the head selections, a number of important preprocessing tasks must be performed. This includes building up necessary custom sets that can be used to correctly select the heads, as well as accounting for shortcomings in the BHSA data. In this section the preprocessing is done.

## Prepare Custom Sets

Below I prepare a series of custom sets that can in turn be used in the head search templates. `A.search` takes an optional argument `sets` which is a dictionary of string keys and set values. The string keys can be written into search templates, so that a key of "wordKind", for instance, can be entered into the template as if it is an object. All of the instantiations of "wordKind" are identified by looking at the set, which contains object nodes (e.g. word nodes). 

In [86]:
sets = {}

### `iphrase_atom` (independent phrase atom)

Below is a set of independent phrase atoms. This set is needed since phrase atoms can exist within a chain of other coordinate phrase atoms, which itself may begin with a dependent element. By definition, a head is not a dependent element. So only phrase atoms in independent chains should be allowed. This requires a recursive check down the phrase atom chain to ensure all relations are independent.

In [87]:
def climb_pa_chain(relalist, phrase_atom):
    '''
    Recursive function that climbs 
    down phrase_atom parallel chains
    to identify all relations in the chain.
    '''
    mother = E.mother.f(phrase_atom)[0]
    relalist.append(F.rela.v(mother))
    if F.rela.v(mother) == 'Para':
        climb_pa_chain(relalist, mother)
        
# iterate through phrase atoms, apply climb_pa_chain, use resulting relations to select iphrase_atoms:
independent_phrasea = [pa for pa in F.otype.s('phrase_atom') if F.rela.v(pa) == 'NA']
for pa in F.rela.s('Para'):  
    chained_relas = []
    climb_pa_chain(chained_relas, pa)
    if not set(chained_relas) - {'NA', 'Para'}: # <- dependency check happens here: only allowable relas are NA and Para
        independent_phrasea.append(pa)
        
iphrase_atom = set(independent_phrasea)

sets['iphrase_atom'] = iphrase_atom

print(f'{len(iphrase_atom)} independent phrase atoms ready...')

254621 independent phrase atoms ready...


### Substantival Adjectives

Several modifiers in the Hebrew Bible occur substantively without any dependent relations. The most prominent species of these kinds of modifiers are quantifiers, including terms such as כל and מעט. (see also Waltke-O'Connor's helpful discussion at §14.3.1c). However, other kinds of non-quantifying adjectives are included here. One example is the substantive עצם, which often functions attributively to words like יום (as in Gen 7:13 et al., בעצם היום הזה). 

Below these words are identified with a series of sets and queries. In some cases, an entire lexeme can be considered a substantival modifier. In other cases, a query is made which stipulates conditions to trigger the adjectival sense.

#### `quant` (quantifiers)

Different lexemes are used to quantify nouns in the Hebrew Bible. Cardinal numbers are indicated in the BHSA with the feature `ls` (lexical set) and a value of `card`. However, other, more qualitative quantifiers are not formally marked, including lemmas such as כל or חצי. Also not included is the use of בן + cardinal number, where בן functions idiomatically as a part of the quantifying phrase rather than a true head. These cases are defined below and gathered into a `quant` set. 

In [88]:
custom_quants = {'KL/', 'M<V/', 'JTR/',
                 'M<FR/', 'XYJ/', '<FRWN/',
                 'C>R=/', 'MSPR/', 'XYWT/', 
                 'RB/', 'RB=/', 'MXYJT/'}
quantlexs = '|'.join(custom_quants) # pipe separated string for optional use in search templates

# put quantifier word nodes in here
quantifiers = [w for w in F.otype.s('word')
                   if F.lex.v(w) in custom_quants
                   or F.ls.v(w) == 'card']

# for the Hebrew idiom: בנ + quantifier for age
quantifiers.extend(A.search('''

quant:word lex=BN/ st=c nu=sg
/with/
phrase_atom
    quant
    <: word ls=card
/-/

''', shallow=True, silent=True))

quantifiers = set(quantifiers)
sets['quant'] = quantifiers

print(f'{len(quantifiers)} custom quantifiers ready...')

13073 custom quantifiers ready...


### modis (modifiers)

This set consists of substantival terms in the Hebrew Bible such as עצם "same". Other cases will be added in the future as needed.

In [89]:
etsem = set(res[1] for res in A.search('''

phrase_atom
    word lex=<YM/ nu=sg
    <mother- subphrase rela=rec
        word pdp=subs lex#>DM/

'''))

modifiers = set()

modifiers |= etsem

  0.75s 20 results


### Merging Modifiers into Quantifiers

The original logic of this production line included only quantifiers. But as I have used heads, I have seen the need for an expanded set that also includes substantival adjectives. Rather than modify all of the quantifier names, for the time being I include the modifiers into the quantifier set itself. So in some places the documentation or code may refer to quantifiers where the implementation actually also includes modifiers.

In [90]:
sets['quant'] |= modifiers

### Further Research Needed

Note that numerous other cases need to be investigated. These can be isolated using the final heads feature with the following query:

In [91]:
adjectival_np_heads = '''

phrase typ=NP
    <head- word pdp=adjv vt#ptca|ptcp
    
'''

The query finds about 100 or so cases, several of which should somehow be incorporated with the sets above. But there are a number of complexities that prevent a simple assignment. Many of the adjectival relations are verbal in nature, with the substantive in construct serving as an object (e.g. יראי אלהים in Ps 66:16). **Thus, this section requires further research and development.**

### `prep` (prepositions)

Many prepositions are marked in BHSA with the feature `pdp` (phrase dependent part of speech) with a value of `prep`. However, this is not true of all prepositions. Other prepositions are marked with the `ls` (lexical set) feature with a value of `ppre` (potential preposition). Still other semi-prepositional lemmas are missed, such as פנה when used before ל (as in לפני), words indicating position, such as תוך (middle), קץ (end), or those indicating continuity such as עוד (still).

Preposition definitions need to be further investigated and defended. Some terms that are pseudo-prepositional, such as פתח "entrance" (e.g. in פתח אהל "entrance of a tent"), are not included. More investigation is needed to determine sound criteria for prepositions as a discrete class.

In [92]:
# prepare prepositions set

preps = [w for w in F.otype.s('word') if F.pdp.v(w) == 'prep']

# add special בד "alone" when it is 
# preceded by ל, with a meaning of "except"
preps.extend(A.search('''

prep:word lex=BD/
/with/
phrase_atom
    word pdp=prep lex=L
    <: prep
/-/

''', shallow=True, silent=True))

# The prepositions below are lemma sets like פנה or תוך
# These sets could benefit from further investigation
preps.extend(A.search('''

prep:word prs=absent lex=PNH/|TWK/|QY/|QYH=/|QYT/|<WD/
/with/
% ensure potential prep is preceded by:
% prep, potential prep (ls), or כל
% else there should be no interruption
phrase_atom
    word
    /with/
    pdp=prep
    /or/
    ls=ppre
    /or/
    lex=KL/
    /-/
    <: prep
/-/
/with/
% ensure prep is followed by at least one non ו word
phrase_atom
    prep
    <: word lex#W
/-/

''', shallow=True, silent=True))

# several cases of אחרית are substantive in nature, e.g. אחרית רשׁעים "end of evil doers" (Ps 37:38)
# others are used prepositionally to indicate position
# the semantics of the phrase is important for determining which sense is employed
# all cases in Time Phrases appear prepositional
# if used with an animate noun, it appears that אחרית is used substantivally
# those cases can be manually excluded with a lexeme exclusion
# NB: גים in Jer 50:12 is used non-personally and thus not excluded
# Excluded: איוב and רשעים
preps.extend(A.search('''

prep:word lex=>XRJT/
/with/
phrase_atom
    prep
    <mother- subphrase rela=rec
        word pdp=subs lex#>JWB/|RC</
/-/

''', shallow=True, silent=True))

In [93]:
# Below potential preps are added, but דרך is excluded
# since this is a more speculative preposition
preps.extend(A.search('''

word ls=ppre st=c lex#DRK/

''', shallow=True, silent=True))

print(f'{len(preps)} custom prepositions ready...')

78154 custom prepositions ready...


#### Finalize Preps Set

In [94]:
preps = set(preps)
sets['prep'] = preps

In [95]:
len(preps)

76914

### `nonprep`, `quantprep`, and `nonquantprep`

The custom prepositions defined below are often marked in BHSA with a `pdp` (phrase dependent part of speech) of `subs` (substantive) rather than `prep`. This results in unwanted selections. Furthermore, there are many cases where the selection of a quantifier is to be explicitly disallowed. The sets `nonprep`, `quantprep`, and `nonquantprep` enable exclusions or selections to be made without lengthening the search templates. 

In [96]:
quantpreps = quantifiers|preps

# non quantifiers
non_quant = set(w for w in F.otype.s('word') if w not in quantifiers)

# non prepositions
non_prep = set(w for w in F.otype.s('word') if w not in preps)

# non quantifiers or prepositions
nonquantprep = set(w for w in F.otype.s('word') if w not in quantpreps)

sets['nonprep'] = non_prep
sets['nonquant'] = non_quant
sets['nonquantprep'] = nonquantprep

print(f'{len(non_quant)} non-quantifying words ready')
print(f'{len(non_prep)} non-prepositional words ready')
print(f'{len(nonquantprep)} non-prepositional, non-quantifier words ready')

413497 non-quantifying words ready
349676 non-prepositional words ready
336583 non-prepositional, non-quantifier words ready


### `postprep` and `nonpostprep`

It is often necessary to ensure a word is or isn't preceded immediately by a preposition. Due to the potential presence of an intermediating article, these cases can become lengthy within the templates. These two sets provide a way to reference these words simply.

In [97]:
precede_prep = A.search('''

word
/with/
phrase_atom
    prep
    <: ..
/or/
phrase_atom
    prep
    <: word pdp=art
    <: ..
/-/
''', shallow=True, sets=sets, silent=True)

postprep = set(precede_prep)
nonpostprep = set(w for w in F.otype.s('word') if w not in postprep)

sets['postprep'] = postprep
sets['nonpostprep'] = nonpostprep

print(f'{len(postprep)} post prepositional words ready...')
print(f'{len(nonpostprep)} non-post prepositional words ready...')

76191 post prepositional words ready...
350399 non-post prepositional words ready...


## Missing Relation Problems in BHSA

The BHSA subphrases have several missing relations that prevent correct head selection. These missing relations illustrate the urgent need for a new data model that can address the shortcomings. Some of these issues may be due to limitations in the ETCBC data creation pipeline. For instance, in that pipeline, a word can only exist in a maximum of 3 subphrase relations. This limitation is caused by the outdated file format of [ps3.p](http://www.etcbc.nl/datacreation/#ps3.p). A BHSA2 that is native to Text-Fabric could address this shortcoming easily.

As a temporary solution, a set is created (`dword`, dependent word), which contains all words that *should* be in a dependent subphrase relation but are not. This set is in turn used to make a set of `iword` (independent_word). For the heads selections, all templates will search for `iword` objects rather than simple `word`. 

In [98]:
dwords = set()

### Missing `atr` and `rec`

There are at least 96 cases in BHSA, found below, which lack a proper `atr` (attribution) or `rec` (nomen regens/rectum) subphrase relation. Manual inspection of the subphrase structures show that this is frequently the case due to the word existing in more than 3 subphrases (the max for ps3.p). Selecting these cases consists of the following parameters:

* find all cases of `subs + subs` or `subs + adjv` in adjacent relation; intervention of a definite article is allowed
* ensure that there is no subphrase that relates the second nominal element to the first
* ensure that the first nominal is in the construct relation
* OR ensure that (1) second nominal is an adjective, or (2) first nominal ends with maqqeph (e.g. כל־), or (3) nominal 1 and 2 occur together and alone in a subphrase.

Coding all of these requirements results in a bit of a lengthy search pattern, but it is effective in isolating the relevant cases. The cases are searched for and displayed below. The head noun is highlighted in green, while the word in relation to it is in pink. Again, the pink words are words which SHOULD have a relation to the green word, but do not have one in BHSA.

In [99]:
missing_atr_rec = A.search('''

phrase
    phrase_atom
        subs:nonquantprep pdp=subs|nmpr    
% stipulate that this word has some relation to the following word
% there are various checks to weed out spurious results:
        /with/
        st=c
        /or/
        <: word pdp=adjv
        /or/
        trailer=&
        /or/
        s1:subphrase
            =: subs
            <: w1:word pdp=adjv|subs
        w1 := s1
        /-/
        
        <: ad:word pdp=adjv|subs
% stipulate that this^ word has no relation to the first
        /without/
        s1:subphrase
            w1:word
        s2:subphrase rela=atr|adj|par|mod
            w2:word
        s1 <mother- s2
        w1 <: w2
        ad = w2
        /-/
        /without/
        w1:word
        s1:subphrase rela=rec
            w2:word
        w1 <mother- s1
        w1 <: w2
        ad = w2
        /-/

''', sets={'nonquantprep':nonquantprep}) + A.search('''

phrase
    phrase_atom
        nonquantprep pdp=subs|nmpr
        <: word pdp=art
        <: ad:word pdp=adjv|subs
% stipulate that this^ word has no relation to the first
        /without/
        s1:subphrase
            w1:word pdp=subs|nmpr
        s2:subphrase rela=atr|adj|par|mod
            w2:word pdp=art
            <: w3:word
            
        s1 <mother- s2
        w1 <: w2
        ad = w3
        /-/
        
        /without/
        w1:word pdp=subs|nmpr
        s1:subphrase rela=rec
            w2:word pdp=art
            w3:word
            
        w1 <mother- s1
        w1 <: w2
        ad = w3
        /-/
''', sets={'nonquantprep':nonquantprep})

  0.95s 46 results
  1.09s 33 results


In [100]:
random.shuffle(missing_atr_rec)

In [101]:
cutoff = 5

for i, res in enumerate(missing_atr_rec[:cutoff]):
    subs = res[2]
    atr = res[3] if len(res) == 4 else res[4]
    highlights = {subs:'lightgreen', atr:'pink'}
    A.prettyTuple(res, end=100, seq=i+1, highlights=highlights)

print(f'\t\t\t\t...RESULTS CUT OFF AT {cutoff}...')

				...RESULTS CUT OFF AT 5...


For all of these cases, we add the second substantive (pink) into the `dwords` set:

In [102]:
dwordsadded = 0
for res in missing_atr_rec:
    dword = res[3] if len(res) == 4 else res[4]
    dwords.add(dword)
    dwordsadded += 1
print(f'{dwordsadded} words added to dwords...')

79 words added to dwords...


#### Coordinations with Modifying Term

There remain multiple cases where the modifying words selected above have a coordinate word. We isolate those cases below and add them to `dwords`. 

In [103]:
par_dwords = A.search('''

phrase
    s1:subphrase
    /without/
        quant
    /-/
        =: dword
    s2:subphrase rela=par
        w:word pdp=subs|nmpr|adjv
        /without/
        subphrase rela=NA
            w
        /-/
s1 <mother- s2

''', sets={'dword':dwords, 'quant': quantifiers}) + A.search('''

phrase
    s1:subphrase
    /without/
        quant
    /-/
        := dword
    s2:subphrase rela=par
        w:word pdp=subs|nmpr|adjv
        /without/
        subphrase rela=NA
            w
        /-/
s1 <mother- s2

''', sets={'dword':dwords, 'quant': quantifiers})


new_par_dwords = set()

for res in par_dwords:    
    newdword = res[4]
    new_par_dwords.add(newdword)
    dwords.add(newdword)
    
print(f'{len(new_par_dwords)} new dwords added to dword set...')

  0.70s 5 results
  0.70s 13 results
13 new dwords added to dword set...


In [104]:
phrases_to_patch = []

# book, chapter, verse, clause_atom number, phrase number
missing_quant_relas = [('Daniel', 3, 23, 444, 2),
                       ('Daniel', 9, 25, 1423, 2),
                       ('Ezra', 1, 9, 39, 1),
                       ('Ezra', 1, 10, 42, 1),
                       ('Ezra', 8, 20, 621, 3),
                       ('1_Chronicles', 12, 29, 1071, 3)]

for book, chapter, verse, clat_nu, phrase_nu  in missing_quant_relas:
    findit = f'''    
    book book@en={book}
        chapter chapter={chapter}
            verse verse={verse}
                clause_atom number={clat_nu}
                    phrase number={phrase_nu}
    '''
    phrase = A.search(textwrap.dedent(findit))[0][4]
    
    A.prettyTuple((phrase,), seq=0)
    print('subphrase relations:')
    show_subphrases(phrase)
        
    phrases_to_patch.append(phrase)

  0.20s 1 result


subphrase relations:
-------1395339----------------

גֻבְרַיָּ֤א  -NA-> 
nodes:  1395339 -NA-> 
slots:  (372120,) -NA-> ()
------------------------------
-------1395340----------------

אִלֵּךְ֙  -dem-> גֻבְרַיָּ֤א 
nodes:  1395340 -dem-> 1395339
slots:  (372121,) -dem-> (372120,)
------------------------------
-------1395341----------------

שַׁדְרַ֥ךְ  -NA-> 
nodes:  1395341 -NA-> 
slots:  (372123,) -NA-> ()
------------------------------
-------1395342----------------

מֵישַׁ֖ךְ  -par-> שַׁדְרַ֥ךְ 
nodes:  1395342 -par-> 1395341
slots:  (372124,) -par-> (372123,)
------------------------------
-------1395343----------------

מֵישַׁ֖ךְ  -NA-> 
nodes:  1395343 -NA-> 
slots:  (372124,) -NA-> ()
------------------------------
-------1395344----------------

עֲבֵ֣ד נְגֹ֑ו  -par-> מֵישַׁ֖ךְ 
nodes:  1395344 -par-> 1395343
slots:  (372126,) -par-> (372124,)
------------------------------
  0.20s 1 result


subphrase relations:
-------1396409----------------

שָׁבֻעִ֞ים שִׁשִּׁ֣ים  -NA-> 
nodes:  1396409 -NA-> 
slots:  (376363, 376364) -NA-> ()
------------------------------
-------1396407----------------

שָׁבֻעִ֞ים  -NA-> 
nodes:  1396407 -NA-> 
slots:  (376363,) -NA-> ()
------------------------------
-------1396408----------------

שִׁשִּׁ֣ים  -par-> שָׁבֻעִ֞ים 
nodes:  1396408 -par-> 1396407
slots:  (376364,) -par-> (376363,)
------------------------------
-------1396410----------------

שְׁנַ֗יִם  -par-> שָׁבֻעִ֞ים שִׁשִּׁ֣ים 
nodes:  1396410 -par-> 1396409
slots:  (376366,) -par-> (376363, 376364)
------------------------------
  0.21s 1 result


subphrase relations:
-------1396959----------------

מַחֲלָפִ֖ים תִּשְׁעָ֥ה  -NA-> 
nodes:  1396959 -NA-> 
slots:  (378383, 378384) -NA-> ()
------------------------------
-------1396957----------------

מַחֲלָפִ֖ים  -NA-> 
nodes:  1396957 -NA-> 
slots:  (378383,) -NA-> ()
------------------------------
-------1396958----------------

תִּשְׁעָ֥ה  -par-> מַחֲלָפִ֖ים 
nodes:  1396958 -par-> 1396957
slots:  (378384,) -par-> (378383,)
------------------------------
-------1396960----------------

עֶשְׂרִֽים׃ ס  -par-> מַחֲלָפִ֖ים תִּשְׁעָ֥ה 
nodes:  1396960 -par-> 1396959
slots:  (378386,) -par-> (378383, 378384)
------------------------------
  0.21s 1 result


subphrase relations:
-------1396975----------------

כֵּלִ֥ים  -NA-> 
nodes:  1396975 -NA-> 
slots:  (378397,) -NA-> ()
------------------------------
-------1396976----------------

אֲחֵרִ֖ים  -atr-> כֵּלִ֥ים 
nodes:  1396976 -atr-> 1396975
slots:  (378398,) -atr-> (378397,)
------------------------------
  0.21s 1 result


subphrase relations:
-------1398659----------------

נְתִינִ֖ים  -NA-> 
nodes:  1398659 -NA-> 
slots:  (381874,) -NA-> ()
------------------------------
-------1398660----------------

מָאתַ֣יִם  -par-> נְתִינִ֖ים 
nodes:  1398660 -par-> 1398659
slots:  (381875,) -par-> (381874,)
------------------------------
-------1398661----------------

מָאתַ֣יִם  -NA-> 
nodes:  1398661 -NA-> 
slots:  (381875,) -NA-> ()
------------------------------
-------1398662----------------

עֶשְׂרִ֑ים  -par-> מָאתַ֣יִם 
nodes:  1398662 -par-> 1398661
slots:  (381877,) -par-> (381875,)
------------------------------
  0.21s 1 result


subphrase relations:
-------1405387----------------

גִּבֹּ֣ור  -NA-> 
nodes:  1405387 -NA-> 
slots:  (398213,) -NA-> ()
------------------------------
-------1405388----------------

חָ֑יִל  -rec-> גִּבֹּ֣ור 
nodes:  1405388 -rec-> 398213
slots:  (398214,) -rec-> ()
------------------------------


Perhaps it is significant that all of these examples come from the same cluster of books: Daniel, Ezra, and 1 Chronicles. This may indicate that the individual who encoded these texts did not understand the standard for relations of quantification in the ETCBC database. These are all cases that a `BHSA2` should address. For now, it is fair to correct them manually by removing all quantifiers selected as heads from these phrses.

The template below is tuned to pick out these examples: primarily they are cases where a phrase atom contains a quantified substantive, and this substantive has no dependent subphrase relations. Two additional checks are made with `/or/` to cover the peculiar cases of Ezra 1:20 (כלים אחרים אלף, i.e. adjv intervenes between quantifier) and Daniel 3:23 (גבריא אלך תלתהן, i.e. where a demonstrative intervenes).

In [105]:
non_rela_cardinals = A.search('''
phrase
    phrase_atom
        
        w1:word ls=card
        
        /without/
        subphrase rela=atr|adj|rec
            w1
        /-/
        /with/
        phrase_atom
            nonquantprep pdp=subs|nmpr
            <: word ls=card prs=absent
            < w1 prs=absent
        /or/
        phrase_atom
            nonquantprep pdp=subs|nmpr
            <: word pdp=prde
            <: w1
        /or/
        phrase_atom
            nonquantprep pdp=subs|nmpr
            <: word pdp=adjv
            < w1 prs=absent
        /-/
        
''', sets=sets)

  1.02s 25 results


In [106]:
A.show([res for res in non_rela_cardinals if res[0] in phrases_to_patch])

Below we check to see how many of our cases are covered by these criteria.

In [107]:
accounted = set(phrases_to_patch) & set(res[0] for res in non_rela_cardinals)
len(accounted)

5

This covers all the missed quantifiers above as well as a few extra that I have manually inspected to ensure none are good heads.

In [108]:
dwordsadded = 0
for res in non_rela_cardinals:
    dword = res[2]
    dwords.add(dword)
    dwordsadded += 1
print(f'{dwordsadded} words added to dwords...')

25 words added to dwords...


## Incorrect Relation Assignment

There are a handfull of cases where the ETCBC data has a relation that points at the wrong object. The few cases below are those which could not be fixed programmatically due to the complexity of the problem.

### Incorrect `par` Relations

In [109]:
bad_pars = []

bad_par1 = '''

book book@en=Jeremiah
    chapter chapter=32
        verse verse=32
            phrase
                word lex=BN/
                <: word lex=JHWDH/
'''
badpar1_note = 'בני־יהודה should be parallel to בני־ישראל rather than רעת בני־ישראל'
bad_pars.append({'template':bad_par1, 'phrasei':3, 'badi':4, 'note':badpar1_note})

bad_par2 = '''

book book@en=Jeremiah
    chapter chapter=40
        verse verse=1
            phrase
                word lex=JHWDH/
'''
badpar2_note = 'יהודה should be parallel to ירושלים rather than גלות־ירושלים'
bad_pars.append({'template':bad_par2, 'phrasei':3, 'badi':4, 'note':badpar2_note})

In [110]:
bad_par_dwords = set()

for i, bp in enumerate(bad_pars):
    bp_res = A.search(bp['template'], silent=False)
    phrase = bp_res[0][bp['phrasei']]
    bad = bp_res[0][bp['badi']]
    bad_par_dwords.add(bad)
    
    print()
    A.prettyTuple((phrase, bad), seq=bp['note'])
    print(f'subphrases containing slot {bad}')
    show_subphrases(bad, direction=L.u)

  0.54s 1 result



subphrases containing slot 252228
-------1366903----------------

בְנֵ֣י  -NA-> 
nodes:  1366903 -NA-> 
slots:  (252228,) -NA-> ()
------------------------------
-------1366905----------------

בְנֵ֣י יְהוּדָ֗ה  -par-> רָעַ֨ת בְּנֵֽי־יִשְׂרָאֵ֜ל 
nodes:  1366905 -par-> 1366902
slots:  (252228, 252229) -par-> (252224, 252225, 252226)
------------------------------
-------1366906----------------

רָעַ֨ת בְּנֵֽי־יִשְׂרָאֵ֜ל וּבְנֵ֣י יְהוּדָ֗ה  -rec-> כָּל־
nodes:  1366906 -rec-> 252223
slots:  (252224, 252225, 252226, 252227, 252228, 252229) -rec-> ()
------------------------------
  0.41s 1 result



subphrases containing slot 256816
-------1368210----------------

יהוּדָ֔ה  -par-> גָּל֤וּת יְרוּשָׁלִַ֨ם֙ 
nodes:  1368210 -par-> 1368209
slots:  (256816,) -par-> (256813, 256814)
------------------------------
-------1368211----------------

גָּל֤וּת יְרוּשָׁלִַ֨ם֙ וִֽיהוּדָ֔ה  -rec-> כָּל־
nodes:  1368211 -rec-> 256812
slots:  (256813, 256814, 256815, 256816) -rec-> ()
------------------------------
-------1368212----------------

כָּל־גָּל֤וּת יְרוּשָׁלִַ֨ם֙ וִֽיהוּדָ֔ה  -rec-> תֹ֨וךְ 
nodes:  1368212 -rec-> 256811
slots:  (256812, 256813, 256814, 256815, 256816) -rec-> ()
------------------------------


As can be seen in the subphrase printouts, these parallel relations do not point to the best subphrase. These problems are fixed below by adding the paralleled terms to `dwords`.

In [111]:
dwords |= bad_par_dwords
list(bad_par_dwords)[0] in dwords # sanity check

True

## Missed Parallels (due to ~bad Spec relations)

Some phrases are inconsistently marked as `Spec` (specification) instead of `Para` in the phrase atom assignment, whereas they are assigned as "par" elsewhere in the subphrase. Below a search is made for such cases.

These cases can be isolated by finding a phrase_atom and its daughter, wherein the daughter has a relation of `Spec` (specification), the daughter is a word-for-word a mirror of the mother, and the two phrase atoms are adjacent to each other.

It is perhaps debateable what it actually means for these items to designated "parallel" instead of "specification." But it does not seem, on the surface, that the repeated element really specifies the first as much as it works in conjunction with it to produce plurality. For this reason, these cases are treated as instances of `Para` relations rather than their designation as `Spec` in BHSA. 

In [112]:
def tokenPhrase(phrasenode):
    '''
    Tokenizer to compare phrase atoms w/out vocalization.
    '''
    return '.'.join([(F.g_cons_utf8.v(w) if F.lex.v(w) != 'H' else 'ה') for w in L.d(phrasenode, 'word')])

In [113]:
tagged = []

for phrasea in F.otype.s('phrase_atom'):
    
    token = tokenPhrase(phrasea)
    daughters = [d for d in E.mother.t(phrasea) # get daughters
                     if F.rela.v(d) == 'Spec' # must have spec rela
                     and tokenPhrase(d) == token # must be identical to mother
                     and d-phrasea == 1] # must be adjacent to mother
    
    if daughters:
        tagged.append((phrasea, daughters))
        
len(tagged)

19

The cases in question are displayed and highlighted below. 

In [114]:
for data in tagged:
    show = (data[0],)+tuple(data[1])
    A.prettyTuple(show, seq=0)

Since the second phrase atom has a `Spec` designation, it is left out of the `iphrase_atom` set, and hence not selected by the parameters further below. We make an exception for these cases by adding the second phrase_atom to the `iphrase_atom` set below.

In [115]:
for para_data in tagged:
    para = para_data[1][0]
    sets['iphrase_atom'].add(para)

## Missed Parallels (due to ~bad adj relations in subphrases)

A select number of cases resemble the problem above, except expressed via the `adj` subphrase relation, which should instead reflect a `par` relation. These cases are difficult to isolate successfully, and there are 4 exclusions that need to be made due to the inadequate nature of the subphrase relations. These cases are manually excluded via reference. The remaining cases are placed in a set which allows them to be excluded from other subphrase exclusions.

In [116]:
tagged_sp = []

for subphrase in F.otype.s('subphrase'):
    
    # manual exclusion
    if T.sectionFromNode(subphrase) in [('Isaiah', 18, 2), 
                                        ('Isaiah', 18, 7),
                                        ('Daniel', 9, 24), 
                                        ('2_Chronicles', 30, 10)]:
        continue
    
    token = tokenPhrase(subphrase)
    daughters = [d for d in E.mother.t(subphrase) # get daughters
                     if F.rela.v(d) == 'adj' # must have spec rela
                     and tokenPhrase(d) == token # must be identical to mother
                     and d-subphrase == 1] # must be adjacent to mother
                    #and L.d(d, 'word')[0] - L.d(subphrase, 'word')[-1] == 1] # must be adjacent to mother

    
    if daughters:
        tagged_sp.append((subphrase, daughters))
        
len(tagged_sp)

5

In [117]:
for data in tagged_sp:
    highlights={}
    
    for w in L.d(data[0], 'word'):
        highlights[w]='pink'
    for w in L.d(data[1][0], 'word'):
        highlights[w]='lightblue'
    
    show = L.d(data[0], 'word') + L.d(data[1][0], 'word')
    A.prettyTuple(show, seq=0, highlights=highlights)

In [118]:
goodsubphrases = set()

for data in tagged_sp:
    good_sp = data[1][0]
    goodsubphrases.add(good_sp)

In [119]:
sets['goodsp'] = goodsubphrases

### NB: `goodsp` is only needed, and hence only used, in the prepositional phrase selection, since that is the only attested phrase type above. This prevents unnecessary editing for now. However, if `goodsp` is needed later on, it will need to be added to all of the search templates!

## Completing `iword`

Below `dword` is finalized and used to create the `iword` set.

In [120]:
iwords = set(w for w in F.otype.s('word') if w not in dwords)
sets['iword'] = iwords
sets['dword'] = dwords

# Selecting Heads

Now that the preprocessing procedures are complete, they can be applied to the search templates to find the phrase heads. I will follow a process of deduction for assigning heads to phrases. So, we first select all phrases, and then track which heads are accounted for.

In [121]:
remaining_phrases = set(result[0] for result in A.search('phrase')) # get all phrases
covered_phrases = set() # put covered phrases here
remaining_types = list(feat[0] for feat in F.typ.freqList(nodeTypes='phrase')) # track and elminate phrase types

  0.18s 253203 results


**All phrase to head assignments will be made in the dictionary below:**

In [122]:
phrase2heads = collections.defaultdict(set)

In order to assist the process of elimination, the functions below programmatically record the heads in `phrase2heads` and remove them from the remaining set. `query_heads` iterates through a dictionary of queries and calls `record_head` on each result. `heads_status` provides a simple readout of what phrases remain to be analyzed.

In [123]:
def record_head(phrase, head, mapping=phrase2heads, remaining=remaining_phrases, covered=covered_phrases):
    '''
    Simple function to track phrases
    with heads that are accounted for
    and to modify the phrase2heads
    dict, which is a mapping from a phrase
    node to its head nodes.
    '''
    # try/except accounts for phrases with plural heads, 
    # one of which is already recorded
    try:
          remaining.remove(phrase)
    except: 
        pass
    
    if F.otype.v(phrase) == 'word':
        raise Exception(f'node {phrase} is a word not a phrase!')
    
    mapping[phrase].add(head) # record it
    covered.add(phrase)
    
def query_heads(querydict, phrasei=0, headi=1, sets={}):
    '''
    Runs queries on phrasetype/query dict.
    Reports results.
    Adds results.
    
    phrasei - the index of the phrase result in the search template.
    headi - the index of the head result in the search template.
    sets - custom sets for TF search
    '''
    for phrasetype, query in querydict.items():
        print(f'running query on {phrasetype}')
        results = A.search(query, silent=True, sets=sets)
        print(f'\t{len(results)} results found')
        for res in results:
            phrase, head = res[phrasei], res[headi]
            record_head(phrase, head)
            
def heads_status():
    # simply prints accounted vs unaccounted heads
    print(f'{len(covered_phrases)} phrases matched with a head...')
    print(f'{len(remaining_phrases)} phrases remaining...')

## Simple Heads

The selection of heads for certain phrase types is very straightforward. Those are defined in the templates below and are subsequently applied. These phrase types are selected based on the survey of their subphrase relations as found in the old notebook.

In [124]:
simp_heads = dict(

PPrP = '''
% personal pronoun

phrase typ=PPrP
    iphrase_atom
        iword pdp=prps

''',

DPrP = '''
% demonstrative pronoun

phrase typ=DPrP
    iphrase_atom
        iword pdp=prde

''',

InjP = '''
% interjectional

phrase typ=InjP
    iphrase_atom
        iword pdp=intj

''',

NegP = '''
% negative

phrase typ=NegP
    iphrase_atom
        iword pdp=nega

''',

InrP = '''
% interrogative

phrase typ=InrP
    iphrase_atom
        iword pdp=inrg

''',
    
IPrP = '''
% interrogative pronoun

phrase typ=IPrP
    iphrase_atom
        iword pdp=prin

''',

) # end of dictionary

### Make Queries, Record Heads, See What Remains

Here we run the queries and run `record_head` over each result. In all of the templates the head is the second item in the result tuple.

In [125]:
query_heads(simp_heads, headi=2, sets=sets)
        
print('\n', '<>'*20, '\n')
heads_status()

running query on PPrP
	4387 results found
running query on DPrP
	790 results found
running query on InjP
	1883 results found
running query on NegP
	6742 results found
running query on InrP
	1291 results found
running query on IPrP
	798 results found

 <><><><><><><><><><><><><><><><><><><><> 

15866 phrases matched with a head...
237337 phrases remaining...


### Find Remaining Phrases

What phrases with the above types remain unaccounted for?

In [126]:
unaccounted_simp = set(phrase for phrase in remaining_phrases
                          if F.typ.v(phrase) in simp_heads)
len(unaccounted_simp)

0

In [127]:
for typ in simp_heads:
    remaining_types.remove(typ)
print(remaining_types)

['VP', 'PP', 'CP', 'NP', 'PrNP', 'AdvP', 'AdjP']


## Mostly Simple Heads

The next set of heads require a bit more care since they can contain a bigger variety of relationships.

### VP
There is only one complication for the VP: that is that there is one VP that has more than one verb:

In [128]:
mult_verbs = A.search('''

phrase typ=VP
/with/
    word pdp=verb
    < word pdp=verb
/-/
''')
A.show(mult_verbs, condenseType='clause')

  0.69s 1 result


The template below excludes this case without ignoring VP's that do not necessarily begin with a verb.

In [129]:
VP = '''

phrase typ=VP
    
    head:iword pdp=verb
    
    /without/
    phrase
        word pdp=verb
        < head
    /-/

'''

VP_search = A.search(VP, sets=sets)

for phrase, head in VP_search:
    record_head(phrase, head)
    
heads_status()

  0.83s 69024 results
84890 phrases matched with a head...
168313 phrases remaining...


See what's left...

In [130]:
unaccounted_vp = set(phrase for phrase in remaining_phrases
                          if F.typ.v(phrase) == 'VP')
len(unaccounted_vp)

0

In [131]:
remaining_types.remove('VP')
print(remaining_types)

['PP', 'CP', 'NP', 'PrNP', 'AdvP', 'AdjP']


### CP

The conjunction phrase is relatively straightforward. But there are 1140 cases where the conjunction is technically headed by a preposition in the ETCBC data. These are phrases such as בטרם and בעבור (see the more detailed analysis in the prev. notebook). It is not clear at all why the ETCBC encodes these as conjunction phrases. This is almost certainly a confusion of the formal `typ` value and the functional `function` label (with a value of `Conj`). Nevertheless, here we make a choice to select the preposition as the true head.

In a BHSA2, these cases ought to be repaired.

In [132]:
cp_heads = dict(

conj = '''

phrase typ=CP
/without/
    word pdp=prep
/-/
    iphrase_atom
        head:iword pdp=conj
        /without/
        phrase_atom
            word pdp=conj
            <: head
        /-/

''',
    
prep_conj = '''

phrase typ=CP
    iphrase_atom
        =: word pdp=prep

'''

)



In [133]:
query_heads(cp_heads, headi=2, sets=sets)
        
print('\n', '<>'*20, '\n')
heads_status()

running query on conj
	51341 results found
running query on prep_conj
	1140 results found

 <><><><><><><><><><><><><><><><><><><><> 

137371 phrases matched with a head...
115832 phrases remaining...


#### CP Sanity Check

In [134]:
unaccounted_cp = set(phrase for phrase in remaining_phrases
                          if F.typ.v(phrase) == 'CP')
len(unaccounted_cp)

0

In [135]:
remaining_types.remove('CP')
print(remaining_types)

['PP', 'NP', 'PrNP', 'AdvP', 'AdjP']


### AdjP

The adjective phrase always occurs with a word that has a `pdp` of adjective: 

In [136]:
A.search('''

phrase typ=AdjP
/without/
    word pdp=adjv
/-/

''')

  0.34s 0 results


[]

By playing with the `head:word pdp=` value below, I ascertain that there are 8 uses of `subs` as a head in this phrase type, and 1 use of `advb` as a head. These variants are due to the phrase containing multiple heads, with the first having a `pdp` of `adjv`, formally making the phrase an `AdjP`. 

The selection criteria is as follows. We want all cases in an adjective phrase where the word has a `pdp` of `adjv`, `subs`, or `advb`. The head candidate must not be found in a modifying subphrase, defined as `rela=adj|atr|rec|mod|dem` (remember that a word can often occur in multiple subphrases); and the only acceptable values for phrase_atom and subphrase relations are either `NA` (no relation), or `Para`/`par` (coordinate relation). In this latter case, it is expected here that the first requirement will prevent spurious parallel results (that is, words that are parallel not to a head but to a modiying element).

The requirements are set in the pattern below.

In [137]:
AdjP = '''

phrase typ=AdjP
    iphrase_atom
        head:iword pdp=adjv|subs|advb
        
% require either NA subphrase relation
% or no subphrase embedding:
        /with/
        subphrase rela=NA|par
            head
        /or/
        /without/
        subphrase
            head
        /-/
        /-/
        
% exclude uses as modifier:
        /without/
        subphrase rela=adj|atr|rec|mod|dem
            head
        /-/
        
% ensure word is not immediately preceded by a construct form
        /without/
        phrase_atom
            word st=c
            <: head
        /-/
'''
AdjP = A.search(AdjP, sets=sets)

for res in AdjP:
    phrase, head = res[0], res[2]
    record_head(phrase, head)
    
heads_status()

  0.87s 1871 results
139120 phrases matched with a head...
114083 phrases remaining...


In [138]:
unaccounted_AdjP = set(phrase for phrase in remaining_phrases
                          if F.typ.v(phrase) == 'AdjP')
len(unaccounted_AdjP)

0

In [139]:
remaining_types.remove('AdjP')
print(remaining_types)

['PP', 'NP', 'PrNP', 'AdvP']


### AdvP

The adverb phrase has similar internal relations to AdjP. Thus, we apply the same basic template search.

By modifying `pdp=` parameter, I have found 2 examples of a preposition in the AdvP, which is caused by a prepositional phrase_atom coordinated with the `AdvP` phrase atom. These mixed cases must be dealt with imperfectly by taking the preposition head literally. It is then up to the user of the heads feature to include/exclude cases such as these, or to depend on the phrase_atoms.

There are two cases where an `inrg` serves as a head element. These are incorrect encodings, as they belong under their own phrase type of `InrP`. These should be fixed in BHSA2. For now the `inrg` is excluded as a phrase head as they are followed by a `advb` which is probably triggering these phrases' classification.

There is one case in sentence 68 from Exodus 8:20 where a כל quantifier is incorrectly identified as a phrase head. This is because it precedes a prepositional element. That case is also excluded below.

In [140]:
AdvP = '''

phrase typ=AdvP
    iphrase_atom
        head:iword pdp=advb|subs|nmpr|prep
        
% require either NA subphrase relation
% or no subphrase embedding:
        /with/
        subphrase rela=NA|par
            head
        /or/
        /without/
        subphrase
            head
        /-/
        /-/
        
% exclude uses as modifier:
        /without/
        subphrase rela=adj|atr|rec|mod|dem
            head
        /-/
        
% ensure word is not immediately preceded by a construct form
        /without/
        phrase_atom
            word st=c
            <: head
        /-/
        
% ensure word is not immediately preceded by a prepositional form
        /without/
        phrase_atom
            word pdp=prep
            <: head
        /-/
'''
AdvP = A.search(AdvP, sets=sets)

for res in AdvP:
    phrase, head = res[0], res[2]
    record_head(phrase, head)
    
heads_status()

  1.04s 5777 results
144780 phrases matched with a head...
108423 phrases remaining...


In [141]:
unaccounted_AdvP = set(phrase for phrase in remaining_phrases
                          if F.typ.v(phrase) == 'AdvP')
len(unaccounted_AdvP)

0

In [142]:
remaining_types.remove('AdvP')
print(remaining_types)

['PP', 'NP', 'PrNP']


### PP

The same method used above applies to prepositional phrases.

In [143]:
PP = '''

phrase typ=PP
    iphrase_atom
        head:iword pdp=prep
        
% require either NA subphrase relation
% or no subphrase embedding:
        /with/
        subphrase rela=NA|par
            head
        /or/
        /without/
        subphrase
            head
        /-/
        /or/
        goodsp
            head
        /-/
        
% exclude uses as modifier:
        /with/
        /without/
        subphrase rela=adj|atr|rec|mod|dem
            head
        /-/
        /or/
        goodsp
            head
        /-/
        
% ensure word is not immediately preceded by a construct form
        /without/
        phrase_atom
            word st=c
            <: head
        /-/
        
% ensure word is not immediately preceded by a preposition
        /without/
        phrase_atom
            word pdp=prep
            <: head
        /-/
'''
PP = A.search(PP, sets=sets)

for res in PP:
    phrase, head = res[0], res[2]
    record_head(phrase, head)
    
heads_status()

  0.97s 61509 results
202263 phrases matched with a head...
50940 phrases remaining...


In [144]:
unaccounted_PP = set(phrase for phrase in remaining_phrases
                          if F.typ.v(phrase) == 'PP')
len(unaccounted_PP)

0

In [145]:
remaining_types.remove('PP')
print(remaining_types)

['NP', 'PrNP']


## Complex Heads

In contrast to the preceding phrase types, the noun phrase is much more complicated for head selection due to the presence of quantifers. The search templates are thus quite lengthy. Each one has been rigorously tested, and each change has been run against a previous version of the template to ensure that any edits did not accidentally shorten or expand the search results beyond the desired effect. 

### NP and PrNP


Note that some noun phrases contain other phrase types, such as `PP` or even `AdjP` that are not indicated in the present implementation of the data. BHSA2 should seek to remedy this by spinning new phrases with their own types for these.

In [146]:
NP_heads = dict(
    
NP_noqant = f'''

phrase typ=NP|PrNP|DPrP|PPrP
    iphrase_atom
        head:nonquant pdp=subs|adjv|nmpr|prde|prps

% require either NA subphrase relation
% or no subphrase embedding:
        /with/
        subphrase rela=NA|par
            head
        /or/
        /without/
        subphrase
            head
        /-/
        /-/
        /with/
        = iword
        /-/
        
% exclude uses as modifier:
        /without/
        subphrase rela=adj|atr|rec|mod|dem
            head
        /-/
        
% ensure word is not immediately preceded by a construct form
        /without/
        phrase_atom
            word st=c
            <: head
        /-/
        
% ensure word is not immediately preceded by a verb (participle) + preposition
        /without/
        phrase_atom
            word sp=verb
            <: prep
            <: head
        /-/
        /without/
        phrase_atom
            word sp=verb
            <: prep
            <: word pdp=art
            <: head
        /-/
''',

NP_quant_alone = f'''

phrase typ=NP|PrNP|DPrP|PPrP
    iphrase_atom
        quantifier:quant

% quantifier does not precede a quantified element within a subphrase
        /without/
        subphrase
            quantifier
            < w1:nonquantprep pdp=adjv|subs|nmpr|prde|prps
            /with/
            = nonpostprep
            /-/
        /-/ 

% quantifier not immediately adjacent to quantified element within a phrase_atom
        /without/
        phrase_atom
            quantifier
            <: w1:nonquantprep pdp=subs|nmpr|prde|prps
        /-/
        /without/
        phrase_atom
            quantifier
            <: word pdp=art
            <: w1:nonquantprep pdp=subs|nmpr|prde|prps
        /-/
        /without/
        phrase_atom
            w1:nonquantprep pdp=subs|nmpr|prde|prps
            <: quantifier
        /-/
        /without/
        phrase_atom
            w1:nonquantprep pdp=subs|nmpr|prde|prps
            <: word pdp=art
            <: quantifier
        /-/
        
        
% quantifier is not construct with quantified element
        /without/
        quantifier
        <mother- subphrase rela=rec
            nonquantprep pdp=subs|nmpr|prde|prps
        /-/
        /without/
        phrase_atom
            quantifier st=c
            <: nonquantprep pdp=subs|nmpr|prde|prps
        /-/
    
% quantifier is not in another relation with a quantified element
        /without/
        s1:subphrase
            quantifier
        s2:subphrase rela=adj|atr|dem
            w1:nonquantprep pdp=subs|nmpr|prde|prps
            /with/
            = nonpostprep
            /-/
% exclude cases where a prepositional object occurs non-adjacently
            /without/
            subphrase
            /without/
                quant
            /-/
                =: prep
                w1
            /-/
        s1 <mother- s2
        /-/

% ensure quantifer is not in a quantifying chain
% there are numerous possible relations
        /without/
        phrase
            phrase_atom rela=NA|Para
            /with/
                s1:subphrase
                    nonquantprep pdp=subs|nmpr|prde|prps
                s2:subphrase rela=adj|atr
                    word ls=card
                s1 <mother- s2
            /or/
                s1:subphrase
                    word ls=card
                s2:subphrase rela=adj|atr
                    nonquantprep pdp=subs|nmpr|prde|prps
                    /with/
                    = nonpostprep
                    /-/
                s1 <mother- s2
            /or/
                word ls=card
                <mother- subphrase rela=rec
                    nonquantprep pdp=subs|nmpr|prde|prps
            /or/
                nonquantprep pdp=subs|nmpr|prde|prps
                <mother- subphrase rela=rec
                    word ls=card
            /or/
                nonquantprep pdp=subs|nmpr|prde|prps st=c
                <: word ls=card
            /-/
% quantifier is either a cardinal number of BN/ in chain
            w1:word
            /with/
            ls=card prs=absent
            /or/
            lex=BN/ prs=absent
            /-/
            
            quantifier = w1
        /-/

% exclude uses as modifier:
        /without/
        subphrase rela=adj|atr|rec|mod|dem
            quantifier
        /-/
        /with/
        = iword
        /-/
''',)

NP_complex = dict(NP_quantified = '''

phrase typ=NP|PrNP|DPrP|PPrP
    iphrase_atom
   
% ensure that word is quantified with a head-word quantifier
% NB: what follows is a long chain of specs on quantifier

        quantifier:quant

% quantifier not used in rec relations to non-prepositions
        /without/
        nonprep
        <mother- subphrase rela=rec
            quantifier
            w1:word
            /without/
            phrase_atom
                prep
                <: w1
            /-/
            /without/
            phrase_atom
                prep
                <: word pdp=art
                <: w1
            /-/
            w1 = quantifier
        /-/

% quantifier not used in adj relations to non-quantifiers
        /without/
        subphrase
        /with/
            nonquant pdp#conj|art
        /-/
        <mother- subphrase rela=adj
            quantifier
        /-/

% ------------------------------
% NB: what follows is a long chain of specs on head

% require adjacency to quantifier
        <1: subphrase
            head:nonquant pdp=subs|adjv|advb|nmpr|prde|prps
    
% quantified word is not a dependent modifier
% exclude non-quant construct state
            /without/
            nonquant st=c
            <: head
            /-/
            /without/
            nonquant st=c
            <: word pdp=art
            <: head
            /-/

% exclude non-quant rec relas
            /without/
            nonquantprep
            <mother- subphrase rela=rec
                head
            /-/
    
% exclude non-quant para rec relas
            /without/
            nonquantprep
            <mother- subphrase rela=rec
            <mother- subphrase rela=par
                head
            /-/
        
% exclude non-quant adjunct relas
            /without/
            subphrase
            /without/
                := quant
            /-/
            <mother- subphrase rela=adj
                head 
            /-/
    
% exclude non-quant para adjunct relas
            /without/
            subphrase
            /without/
                := quant
            /-/
            <mother- subphrase rela=adj
            <mother- subphrase rela=par
                head
            /-/

% exclude demonstrative relas when demonstrative points to subphrase with words other than quantifiers
            /without/
            subphrase
            /with/
                nonquant pdp#art|conj
            /-/
            <mother- subphrase rela=dem
                head 
            /-/

% exclude all other kinds of relations
            /without/
            subphrase rela=atr|mod
                head
            /-/
            /with/
            = iword
            /or/
            quant
            <: head
            /or/
            quant
            <: word pdp=art
            <: head
            /or/
            head
            <: quant
            /-/
            
% exclude words with immediately preceding prepositions
            /without/
            prep
            <: head
            /-/
            /without/
            prep
            <: word pdp=art
            <: head
            /-/
''',)

query_heads(NP_heads, headi=2, sets=sets)
query_heads(NP_complex, headi=4, sets=sets)
print('\n', '<>'*20, '\n')
heads_status()

running query on NP_noqant
	56963 results found
running query on NP_quant_alone
	1813 results found
running query on NP_quantified
	5709 results found

 <><><><><><><><><><><><><><><><><><><><> 

253200 phrases matched with a head...
3 phrases remaining...


# `obj_prep`

The feature `prep_obj` in `v.1` was an edge feature from a word to its governing preposition. As is done with the nouns above, this would would be a nominal element that is disambiguated from its quantifiers. Since there is no dependency of a prepositional object, the nominal templates developed above can be used with the single change that the phrase type is a `PP` or `CP` (which also has prepositional objects!). 

Since `v.2` will encode edges from words to phrases rather than the other way around, this feature will encode an edge from the object to the preposition, hence the new feature name.

<hr>

In [147]:
pp_obj_queries = {}
    
PP_noqant = f'''

phrase_atom
    prep prs=absent
    < head:nonquant pdp#conj|art|prep|nega

% either word is adjacent to prep
    /with/
    phrase_atom
        prep
        <: head
        
% or word is adjacent to prep but interrupted by article
    /or/
    phrase_atom
        prep
        <: word pdp=art
        <: head
    
    /or/

% or word is w1, an independent, non-modifying word
% what follows is a long description for that situation

    w1:word

% exclude w1 uses as modifier
    /with/
    /without/
    subphrase rela=adj|atr|mod|dem
        w1
    /-/
    /or/
    goodsp
        w1
    /-/
    /with/
    = iword
    /-/

% exclude w1 rec relations to non-prepositions
    /without/
    nonprep
    <mother- subphrase rela=rec
        w1
    /-/

% ensure w1 is not immediately preceded by a construct form
    /without/
    phrase_atom
        nonprep st=c
        <: w1
    /-/
    

% exclude cases where word occurs in a subphrase immediately before a preposition
% only 1 case of this, but may be other edge cases this misses.
    /without/
    s1:subphrase
        prep
    s2:subphrase rela=par
        w1
        <: prep
    s1 <mother- s2
    /-/
    
    w1 = head
    /-/

'''

pp_obj_queries['PP_noqant'] = {'template': PP_noqant,
                               'prepi': 1,
                               'obji': 2}

PP_quant_alone = f'''

phrase_atom
    prep prs=absent
    < quantifier:quant

% quantifier does not precede a quantified element within a subphrase
    /without/
    subphrase
        quantifier
        < w1:nonquantprep pdp=subs|adjv|advb|nmpr|prde|prps
    
        /without/
        = postprep
        /-/
        
        /without/
        subphrase
        /without/
            quant
        /-/
            prep
            < w1
        /-/
    /-/ 
    

% quantifier not immediately adjacent to quantified element within a phrase_atom
    /without/
    phrase_atom
        quantifier
        <: w1:nonquantprep pdp=subs|nmpr|prde|prps
    /-/
    /without/
    phrase_atom
        quantifier
        <: word pdp=art
        <: w1:nonquantprep pdp=subs|nmpr|prde|prps
    /-/
    /without/
    phrase_atom
        w1:nonquantprep pdp=subs|nmpr|prde|prps
        <: quantifier
    /-/
    /without/
    phrase_atom
        w1:nonquantprep pdp=subs|nmpr|prde|prps
        <: word pdp=art
        <: quantifier
    /-/
        
% quantifier is not construct with quantified element
    /without/
    quantifier
    <mother- subphrase rela=rec
        nonquantprep pdp=subs|adjv|advb|nmpr|prde|prps
    /-/
    /without/
    phrase_atom
        quantifier st=c
        <: nonquantprep pdp=subs|adjv|advb|nmpr|prde|prps
    /-/
    
% quantifier is not in another relation with a quantified element
    /without/
    s1:subphrase
        quantifier
    s2:subphrase rela=adj|atr|dem
        w1:nonquantprep pdp=subs|adjv|advb|nmpr|prde|prps 
        /without/
        = postprep
        /-/
        /without/
        subphrase
        /without/
            quant
        /-/
            < prep
            w1
        /-/
        
    s1 <mother- s2
    /-/

% ensure quantifer is not in a quantifying chain
    /without/
    phrase_atom
    /with/
        s1:subphrase
            nonprep pdp=subs|adjv|nmpr|prde|prps lex#{quantlexs} ls#card
        s2:subphrase rela=adj|atr
            word ls=card
        s1 <mother- s2
    /or/
        s1:subphrase
            word ls=card
        s2:subphrase rela=adj|atr
            w1:nonprep pdp=subs|adjv|nmpr|prde|prps lex#{quantlexs} ls#card
            /without/
            = postprep
            /-/
            /without/
            subphrase
            /without/
                quant
            /-/
                < prep
                w1
            /-/
        s1 <mother- s2
    /-/
        quantifier ls=card prs=absent
    /-/

% exclude uses as modifier:
    /without/
    subphrase rela=adj|atr|rec|mod|dem
        quantifier
        w1:word
        /without/
        = postprep
        /-/
        quantifier = w1
    /-/
    /with/
    = iword
    /-/
'''

pp_obj_queries['PP_quant_alone'] = {'template': PP_quant_alone,
                                    'prepi': 1,
                                    'obji': 2}

PP_quantified = f'''


phrase_atom
    prep prs=absent
   
% ensure that word is quantified with a head-word quantifier
% NB: what follows is a long chain of specs on quantifier

    < quantifier:quant

    /with/
    phrase_atom
        prep
        <: quantifier
    /or/
    
% quantifier not used in rec relations to non-prepositions
    /without/
    nonprep
    <mother- subphrase rela=rec
        quantifier
        w1:word
        /without/
        phrase_atom
            prep
            <: w1
        /-/
        /without/
        phrase_atom
            prep
            <: word pdp=art
            <: w1
        /-/
        w1 = quantifier
    /-/

% quantifier not used in adj relations to non-quantifiers
    /without/
    subphrase
    /with/
        nonquant pdp#conj|art
    /-/
    <mother- subphrase rela=adj
        quantifier
        w1:word
        /without/
        prep
        <: w1
        /-/
        w1 = quantifier
    /-/
    /-/

% ------------------------------
% NB: what follows is a long chain of specs on head

% require adjacency to quantifier
    <1: subphrase
        head:nonquant pdp=subs|adjv|advb|nmpr|prde|prps
        
        /with/
        phrase_atom
            prep
            <: quant
            <: head

        /or/
        phrase_atom
            prep
            <: quant
            <: word pdp=art
            <: head
        
        /or/
    
% quantified word is not a dependent modifier
% exclude construct state to non quants/preps
        /without/
        nonquantprep st=c
        <: head
        /-/
        /without/
        nonquantprep st=c
        <: word pdp=art
        <: head
        /-/

% iword requirements
        /with/
        = iword
        /or/
        quant
        <: head
        /or/
        quant
        <: word pdp=art
        <: head
        /or/
        head
        <: quant
        /-/

% exclude non-quant/prep rec relas
        /without/
        nonquantprep
        <mother- subphrase rela=rec
            head
        /-/
    
% exclude non-quant para rec relas
        /without/
        nonquantprep
        <mother- subphrase rela=rec
        <mother- subphrase rela=par
            head
        /-/
        
% exclude non-quant adjunct relas
        /without/
        subphrase
        /without/
            := quant
        /-/
        <mother- subphrase rela=adj
            head
        /-/
    
% exclude non-quant para adjunct relas
        /without/
        subphrase
        /without/
            := quant
        /-/
        <mother- subphrase rela=adj
        <mother- subphrase rela=par
            head
        /-/

% exclude demonstrative relas when demonstrative points to subphrase with words other than quantifiers
        /without/
        subphrase
        /with/
            nonquant pdp#art|conj
        /-/
        <mother- subphrase rela=dem
            head 
        /-/

% exclude all other kinds of relations
        /without/
        subphrase rela=atr|mod
            head
        /-/
        /-/
'''

pp_obj_queries['PP_quantified'] = {'template': PP_quantified,
                                   'prepi': 1,
                                   'obji': 4}
    
special_quantified = '''

% necessary due to technical limitation in search patterns
phrase_atom
    prep
    <: quant
    <: nonquant pdp=subs|adjv|advb|nmpr|prde|prps
'''
pp_obj_queries['special_quantified'] = {'template': special_quantified,
                                        'prepi': 1,
                                        'obji': 3}
    
PP_to_PP = '''

phrase_atom
    prep
    <: prep
'''

pp_obj_queries['PP_to_PP'] = {'template': PP_to_PP,
                              'prepi': 1,
                              'obji': 2}
    
PP_to_conj = '''

phrase_atom
    prep
    <: w:word pdp=conj
    
    /with/
    phrase_atom typ=CP
        w
    /or/
    lex=C|>CR
    /-/
'''

pp_obj_queries['PP_to_conj'] = {'template': PP_to_conj,
                                'prepi': 1,
                                'obji': 2}

PP_negation = '''

pa:phrase_atom
    pp:prep
    neg:word pdp=nega

pa =: pp
pa := neg
pp # neg
'''
pp_obj_queries['PP_negation'] = {'template': PP_negation,
                                'prepi': 1,
                                'obji': 2}



obj2prep = collections.defaultdict()
prep2obj = collections.defaultdict(set)

for name, query in pp_obj_queries.items():
    template = query['template']
    prepi = query['prepi']
    obji = query['obji']
    
    print(f'running query on {name}...')
    results = A.search(template, sets=sets)

    print('\tprocessing prepositions...')
    for res in results:
        obj = res[obji]
        # back up one slot until a preposition is found
        prep = None
        cur_slot = obj
        while not prep:
            cur_slot -= 1
            if cur_slot in preps:
                prep = cur_slot
                
        obj2prep[obj] = prep
        prep2obj[prep].add(obj)
        
print('\n', '<>'*20, '\n')
print(f'queries complete with {len(obj2prep)} object of preposition mappings...')

running query on PP_noqant...
  1.00s 65920 results
	processing prepositions...
running query on PP_quant_alone...
  0.10s 910 results
	processing prepositions...
running query on PP_quantified...
  0.49s 5409 results
	processing prepositions...
running query on special_quantified...
  0.28s 2038 results
	processing prepositions...
running query on PP_to_PP...
  0.10s 2944 results
	processing prepositions...
running query on PP_to_conj...
  0.25s 1055 results
	processing prepositions...
running query on PP_negation...
  0.28s 1 result
	processing prepositions...

 <><><><><><><><><><><><><><><><><><><><> 

queries complete with 64260 object of preposition mappings...


## Check for Missing Prepositional Objects


In [148]:
# for testing:
for sp in L.d(0, 'subphrase'):
    print(sp, F.rela.v(sp), E.mother.f(sp), T.text(sp))
    print(L.d(sp, 'word'))
    print()

In [149]:
simple_check_results = A.search('''

phrase_atom
    prep prs=absent
    <: word pdp#conj
    
''', sets={'prep': preps})

simple_check_prep = [res for res in simple_check_results if res[1] not in prep2obj]

print(f'{len(simple_check_prep)} prepositions missing...')

A.show(simple_check_prep, withNodes=True, condenseType='phrase_atom', end=100)

  0.68s 62468 results
0 prepositions missing...


# `nheads`

In many cases one does not want to go through prepositions to reach the nominal head elements (i.e. independent substantive, adjective, etc.) in a phrase. For this we can export an additional feature, called `nheads` ("nominal heads"), which simply ignores any prepositions and selects the nominal elements from the phrase and phrase atoms. This feature is built up using the `phrase2heads` and `prep2obj` features already calculated above. 

### Note on AdjP
This feature does not select nominals that are embedded within an adjective phrase (`AdjP`), but those can be selected with the following pattern:

In [150]:
adj_nhead = '''

phrase_atom typ=AdjP
    w1:word 
    /with/
    word pdp=adjv
    <mother- subphrase rela=rec
        w1
    /-/

'''

A.show(A.search(adj_nhead), condenseType='phrase_atom', end=5)

  0.52s 201 results


In [151]:
def find_prep_nominal(preposition, nominals=[]):
    '''
    This function recursively
    moves through prepositional
    chains to obtain the ultimate 
    governed nominal element.
    '''
    objects = prep2obj.get(preposition, None)
    if objects:
        for obj in objects:
            if obj not in sets['prep']:
                nominals.append(obj)
            else:
                find_prep_nominal(obj, nominals=nominals)

In [152]:
nheads = collections.defaultdict(set)

for phrase, heads in phrase2heads.items():
    for head in heads:
        if head not in sets['prep']:
            nheads[phrase].add(head)
        else:
            nominals = []
            find_prep_nominal(head, nominals=nominals)
            if nominals:
                nheads[phrase] |= set(nominals)
            else:
                nheads[phrase].add(head) # added 22.03.19: keep nhead feature for preps without objects
            
print(f'{len(nheads)} nheads assigned...')
print(f'{len(phrase2heads)-len(nheads)} phrases not assigned an nhead...')

253200 nheads assigned...
0 phrases not assigned an nhead...


In [153]:
examples = [(phrase,)+tuple(heads) for phrase, heads in nheads.items()
               #if F.typ.v(phrase) == 'PP'
                if len(heads) > 1
           ]

random.shuffle(examples)

In [154]:
for res in examples[:10]:
    A.prettyTuple(res, condenseType='phrase', withNodes=True, seq=res[0])

# Export TF Data

In this section the dictionaries built up in this notebook are converted to TF data files.


In [155]:
# reverse phrase2heads, nheads mappings for head feature
head = {head:{phrase} for phrase, heads in phrase2heads.items() for head in heads}
nhead = {head:{phrase} for phrase, heads in nheads.items() for head in heads}
obj_prep = {obj:{prep} for obj, prep in obj2prep.items()}

# prep sem_set feature
sem_set = {node:feature for feature, fset in {'prep': preps, 'quant':quantifiers}.items() for node in fset}

In [156]:
# put features in edge/node dicts for TF.save
edge_features = {'head': head,
                 'obj_prep': obj_prep,
                 'nhead': nhead}
node_features = {'sem_set': sem_set}

# metadata needed to write the features
meta = {
    
'': {'created_by': 'Cody Kingham',
     'coreData': 'BHSA',
     'coreVersion': version,
     'source': 'see the creation notebooks at https://github.com/etcbc/heads'},
        
'head' : {'valueType': 'int',
          'edgeValues': False},
        
'obj_prep': {'valueType': 'int',
            'edgeValues': False},
        
'nhead': {'valueType': 'int',
          'edgeValues': False},
        
'sem_set':{'valueType':'str'},
    
}

TF = Fabric(locations='~/github/etcbc/heads/tf', modules=version, silent=True)
TF_api = TF.load('', silent=True)
TF.save(nodeFeatures=node_features, edgeFeatures=edge_features, metaData=meta)
print(f'BHSA {version} EXPORT COMPLETE!')

  0.00s Exporting 1 node and 3 edge and 0 config features to ~/github/etcbc/heads/tf/2021:
   |     0.05s T sem_set              to ~/github/etcbc/heads/tf/2021
   |     0.29s T head                 to ~/github/etcbc/heads/tf/2021
   |     0.32s T nhead                to ~/github/etcbc/heads/tf/2021
   |     0.08s T obj_prep             to ~/github/etcbc/heads/tf/2021
  0.74s Exported 1 node features and 3 edge features and 0 config features to ~/github/etcbc/heads/tf/2021
BHSA 2021 EXPORT COMPLETE!
