# Going Subphraseless

The current method for isolating phrase heads ([here](https://nbviewer.jupyter.org/github/ETCBC/heads/blob/master/phrase_heads.ipynb)) requires strenuous and ineloquent processing of BHSA subphrase relations. The subphrases are not always consistently encoded and suffer from numerous exceptional cases. The result is that the method is rather convoluted and ineloquent.

This notebook will explore the possibility of disconnecting semantic head analysis from the ETCBC subphrase encoding. 

A "semantic" head is the primary content word of a phrase, following Croft's "Primary Information Bearing Unit":

> **The noun and the verb are the PRIMARY INFORMATION_BEARING UNITS (PIBUs) of the phrase and clause respectively. In common parlance, they are the content words. PIBUs have major informational content that functional elements such as articles and [auxiliaries] do not have. (Croft, *Radical Construction Grammar*, 2001, 258; see also Shead, *Radical Frame Semantics and Biblical Hebrew*, 104)**

> **A (semantic) head is the profile equivalent that is the primary information-bearing unit, that is, the most contentful item that most closely profiles the same kind of thing that the whole constituent profiles. (ibid., 259)**

Croft also provides an additional criterion to "profile equivalence":

> **If the criterion of profile equivalence produces two candidates for headhood, the less schematic meaning is the PIBU; that is, the PIBU is the one with the narrower extension, in the formal semantic sense of that term (ibid., 259)**

## Inquiry

Can we isolate semantic phrase heads in BHSA using only the phrase_atom and phrase limits? This question indeed means that we  take the phrase_atom/phrase boundaries for granted. Empirically, the validity of BHSA phrase boundaries needs to be tested. But for now, the exercise of isolating semantic phrase heads could be seen as the first step towards reproducible phrase boundaries.

## Basic Concepts

A semantic head will most often stand in a syntactically independent position. For Hebrew nominal phrases, that essentially means a word which is not precided by a construct, which is not in an attributive slot (e.g. H + noun + H + ATTRIBUTIVE), and which is not in an adjectival slot (e.g. noun + noun as in אישׁ טוב).

The situation is slightly complicated by quantifier expressions, which may be syntactically independent but semantically secondary. These are expressed through specialized lexical items such as cardinal numbers and qualitative quantifiers (e.g.  "כל" and "חצי").

Another complication is the use of nouns as prepositional items. Such uses can be seen with words like פני "face" such as לפני "in front," and even words like ראשׁ as in ראשׁ החדשׁ "beginning of the month." 

Other expressions of quantity, quality, and function provide similar complexities. These cases have to be specified in advance.

## Prerequisites

As discussed above, lexical-semantic information is crucial in separating instances of quantification. These sets have to be defined in advance. As already noted, the BHSA phrase_atom/phrase boundaries are also taken for granted.

## The Noun Phrase

The focus of this initial inquiry is the noun phrase. All of the most complicated problems can be found in this group. Solving the problems in NP classification will likewise allow PP classification to easily follow. In cognitive linguistics, nouns are considered the semantic heads of prepositional phrases. That same definition will be adopted herein.

### A Process of Elimination

This exploratory analysis will proceed via a process of elimination, advancing from simple cases towards the complex ones.

**Let's get started**. We load the necessary functions and BHSA data (straight from source).

In [4]:
import collections
import random
import fasttext
from scipy.spatial.distance import cosine
from tf.app import use
A = use('bhsa', hoist=globals(), silent=True)

def rem_accent(word):
    # Remove accents from words
    return ''.join(c for c in word if c in good_chars)

   |     0.00s No structure info in otext, the structure part of the T-API cannot be used


In [5]:
model = fasttext.load_model('/Users/cody/github/codykingham/bhsa_vectors/model.bin') # get semantic data




As a first step, we need to define what word types will become elligible candidates for noun heads. Parts of speech are frequently construed into various roles. As a result, we must be prepared to accept a variety of potential candidates.

Instead of the BHSA `pdp` ("phrase dependent part of speech") feature, we will rely on the `sp` ("part of speech") feature, which is derived directly from the lexicon.

Let's have a look at the potential `sp` values in the dataset.

In [7]:
F.sp.freqList()

(('subs', 125581),
 ('verb', 75450),
 ('prep', 73298),
 ('conj', 62737),
 ('nmpr', 35696),
 ('art', 30387),
 ('adjv', 10052),
 ('nega', 6059),
 ('prps', 5035),
 ('advb', 4603),
 ('prde', 2678),
 ('intj', 1912),
 ('inrg', 1303),
 ('prin', 1026))

See the short descriptions for the values [here](https://etcbc.github.io/bhsa/features/sp/). 

The following parts of speech could be expected to be construed as NP sem. heads: substantives (`subs`), verbs (i.e. as participles), proper nouns (`nmpr`), adjectives (`adjv`), personal pronouns (`prps`, most often found exclusively in phrases marked a "Personal Pronoun Phrases", but in cases of noun coordination BHSA will not distinguish!), demonstrative pronouns (`prde`). 

It is unknown whether adverbs (`advb`) or interrogative particles (`inrg`) might also function as semantic heads in BHSA phrases. Let's do a quick check. We look for cases where a word with a `sp` value of `advb` becomes a `subs` in its `pdp` feature.

In [8]:
find_advb = A.search('word sp=advb pdp=subs')

  0.53s 0 results


We don't find any cases. How about the interrogative particle inside phrases marked as `NP`?

In [9]:
find_intr = A.search('''

clause
    phrase typ=NP
        word sp=inrg

''')

  0.75s 1 result


In [10]:
A.table(find_intr, condenseType='clause')

n,p,clause,phrase,word
1,1_Chronicles 17:6,בְּכֹ֥ל הֲדָבָ֣ר דִּבַּ֗רְתִּי אֶת־אַחַד֙ שֹׁפְטֵ֣י יִשְׂרָאֵ֔ל,הֲדָבָ֣ר,הֲ


The interrogative does not function as a semantic head here.

Ok. So this leaves us with the other potential part of speech values. Are these iron-clad? Let's do this: we can be even more sure we have the right set by making a broader query: for any word with a `pdp=subs` wherein the `sp != subs`.

We expect to see only: `subs`, `nmpr`, and `adjv`. The values `prps` and `prde` are not likely to change in `pdp`.

In [11]:
set(F.sp.v(w) for w in F.otype.s('word')
        if F.pdp.v(w) == 'subs' and F.sp.v(w) != 'subs')

{'adjv', 'verb'}

This is *mostly* what we expected. But I did forget that `nmpr` will also *not* change inside the NP. So `adjv` and `verb` is exactly what we want to see here.

I see that I've also included `prde` "demonstratives" in the list here. I now very much doubt whether this is ever the case. I'll make a query here to see whether any demonstratives occur outside of a modifying position. Specifically, we require that the demonstrative is not preceded by an article (excludes the attributive pattern H + noun + H + demonstrative) or a noun, and the demonstrative does not precede a noun. 

In [12]:
find_demo = A.search('''

phrase typ=NP
    word sp#art|subs|adjv
    <: word sp=prde
    <: word sp#subs|adjv
''')

  1.66s 2 results


In [13]:
A.table(find_demo, withNodes=True, extraFeatures='sp')

n,p,phrase,word,word.1,word.2
1,Exodus 26:13,הָאַמָּ֨ה מִזֶּ֜ה וְהָאַמָּ֤ה מִזֶּה֙ 678044,מִ 42881,זֶּ֜ה 42882,וְ 42883
2,2_Chronicles 9:18,יָדֹ֛ות מִזֶּ֥ה וּמִזֶּ֖ה 896933,מִ 411458,זֶּ֥ה 411459,וּ 411460


These cases are slightly complicated by the fact that at a higher level they do govern their own phrase, but at the level of the whole phrase they do not. Is this fact reflected in the BHSA phrase structure?

In [14]:
A.pretty(678044, withNodes=True)

So we see a separate phrase atom encoded here. What kind of phrase type value is coded on that phrase atom?

In [15]:
F.typ.v(L.u(42882, 'phrase_atom')[0])

'PP'

Ok, BHSA simply encodes this as a prepositional phrase. However, our algorithm ought to work also with such prepositional phrases. This presents a slight complicating factor. On the one hand, a demonstrative pronoun might be treated like any other substantive, yet it is functionally different in nearly all cases. For example, many cases would consist of a demonstrative + noun pattern. The algorithm might see: noun_candidateA (זה) + noun_candidateB; it sees that noun_candidateB agrees with noun_candidateA in gender and number; pursuant to the rule on adjectives, i.e. Gesenius §132.1a, the algorithm (incorrectly) classifies noun_candidateA as syntactically autonomous and thus the top candidate for semantic headedness.

What to do?

The idea of a "ranking" system is intriguing. In such a system, cases like this receive a ranking based purely on the syntactic patterning. But we can introduce a lexical processing stage that adjusts ranks based on lexical rules. That also helps us handle complex cases such as quantifiers. 

Other cases may be very complicated. An example can be found in the following English noun phrase:

> a. Tim drank a **cup** of coffee.<br>
> b. \*Tim broke a **cup** of coffee. <br>
> <br>
> (Croft, *Radical Construction Grammar*, 262)

The reason statement b is awkward is because "coffee" functions here as a functional head, despite "cup" being in a syntactically autonomous position. This case can be analyzed as the English  [CONTAINER of] (e.g. "cup of") construction in English, which indicates a measure of a substance. A heads algorithm needs to be able to take into account such idiosyncrasies. That information simply has to be hardwired in.

We can now confidently say that valid NP head candidates should have only the following `sp` values: `subs`, `nmpr`, `adjv`, `verb`, `prps`, with some allowances made for `prde`.

In [16]:
cand_sps = {'subs', 'nmpr', 'verb', 'adjv', 'prps', 'prde'}





**It's time to define a custom set of quantifiers and idiosyncratic heads.** All of these values are taken from the [notebook](https://nbviewer.jupyter.org/github/ETCBC/heads/blob/master/phrase_heads.ipynb) that utilizes subphrase relations.

In [142]:
custom_quants = set()
custom_quants |= set(F.ls.s('card')) & set(F.otype.s('word'))
custom_quants |= set(F.ls.s('ordn')) & set(F.otype.s('word'))

quant_lexs = '|'.join(['KL/', 'M<V/', 'JTR/',
                         'M<FR/', 'XYJ/', '<FRWN/',
                         'C>R=/', 'MSPR/', 'RB/', 'RB=/',
                         'XMJCJT/'])
custom_quants |= set(A.search(f'word lex={quant_lexs}', shallow=True, silent=True))

# for the Hebrew idiom: בנ + quantifier for age
for w in set(F.otype.s('word')) & set(F.lex.s('BN/')):
    pos = Positions(w, 'phrase_atom').get
    if all([F.ls.v(pos(1)) == 'card',
            F.st.v(w) == 'c',
            F.nu.v(w) == 'sg']):
        custom_quants.add(w)
        
len(quantifiers)

custom_preps = set()

### Machinery

We could use some machinery to do the hard work of looking in and around a node. In the older approach we used TF search templates. But these are not very efficient at scale, and they are always bound by the limits of the query language. I take another approach here: a class which, when called, can give all sorts of contextual information about any node which is fed in.

In [190]:
class Getter:
    '''
    A class to safely index beyond the limits of
    an iterable with a default returned.
    Like dict.get but for iterables.
    '''
    
    def __init__(self, iterable, default=None):
        self.iterable = iterable
        self.default = default
        
    def __iter__(self):
        for i in self.iterable:
            yield i
        
    def __getitem__(self, key):
        try:
            return self.iterable[key]
        except IndexError:
            return self.default
        
    def index(self, i):
        try:
            return self.iterable.index(i)
        except ValueError:
            return self.default

class Positions:
    '''
    For a given node, provides access to nodes
    that are (+/-)N positions away in terms of 
    node adjacency within a given context.
    The context must be a nodeType that is larger
    than the supplied node.
    '''
    
    def __init__(self, n, context):
        self.n = n
        self.thisotype = F.otype.v(n)
        self.context = self.get_context(context)
    
    def get(self, position, features=None):
        '''
        Get a node (+/-)N positions away, 
        with an option to get values for specified features.
        '''
        positions = Getter(L.d(self.context, self.thisotype))
        index = positions.index(self.n)
        find = sum([index, position])
        
        # return None if beyond boundaries
        if find < 0:
            return None
        
        # return node or requested feature
        toget = positions[find]
        
        # simple node return
        if not features:
            return toget
        
        # give single feature
        if len(features.split())==1:
            return Fs(features).v(toget)
        
        # give pl features
        elif features:
            feats = set()
            for f in features.split():
                feats.add(Fs(f).v(toget))
            return feats
    
    def get_context(self, otype):
        '''
        Returns a requested context node.
        '''
        if otypeRank[self.thisotype] > otypeRank[otype]:
            raise Exception('Provided context is smaller than the provided node!')
        else:
            return Getter(L.u(self.n, otype))[0]

def clear(iterable):
    '''
    Clears iterables of any 
    False or Nonetype objects.
    '''
    return [i for i in iterable if i]
    
class Context:
    '''
    Provides access to a nominal's (non-verb/prep) 
    surrounding grammatical context based on 
    adjacency or its neighborhod. The algorithm 
    assumes that the provided noun has already
    been checked for nominal status (e.g. in the 
    case of verbs and participles). The algorithm 
    assumes no hard boundaries between noun and 
    adjective classes, on the basis that nouns can 
    often serve multiple roles. As such, a nominal can
    govern ("mom") or be governed ("kid") depending
    on its use. Thus, symmetrical data is provided
    in the form of self.moms and self.kids for all
    contexts that are queried, wherein moms are
    governing this word, and kids are governed by
    this word.
    '''
    
    def __init__(self, n):
        p = Positions(n, 'phrase_atom').get
        self.mom = collections.defaultdict(list)
        self.kid = collections.defaultdict(list)
        
        # RETRIEVE RELATIONSHIPS
        
        # ** construct patterns **

        # -- mom
        const_m = clear([
            
            (p(-1) 
                if p(-1,'st')=='c'
                else None),

            (p(-2) 
                if p(-2,'st')=='c'
                and p(-1, 'sp')=='art'
                else None),
        ])
        
        
        # -- kid
        const_k = clear([
            
            (p(1) 
                if p(0,'st')=='c'
                and p(1, 'sp')!='art'
                else None),

            (p(2) 
                if p(0,'st')=='c' 
                and p(1, 'sp')=='art'
                else None),
        ])
        
        self.mom['const'] = Getter(const_m)[0]
        self.kid['const'] = Getter(const_k)[0]
    
        # ** adjective patterns **
        
        # -- mom
        adjv_m = clear([
            
            (p(-1) 
                 if p(0,'sp')=='adjv'
                and p(-1,'nu gn')==p(0,'nu gn')
                and p(-1,'st')=='a'
                else None),

            (p(-1) 
                 if p(-1,'sp vt').issubset({'verb', 'ptcp', 'ptca'})
                and p(-1,'nu gn')==p(0,'nu gn')
                and p(0,'st')=='a'
                else None),

            (p(-2) 
                if p(-1,'sp')=='art'
                and p(-2,'sp') in {'subs','nmpr','adjv'}
                and p(-2,'nu gn')==p(0,'nu gn')
                and p(-2,'st')=='a'
                else None),
        ])
        
        # -- kid
        adjv_k = clear([
            
            (p(1)
                if p(0,'sp')=='adjv'
                and p(1,'nu gn')==p(0,'nu gn')
                and p(0,'st')=='a'
                else None),

            (p(1)
                if p(1,'sp vt').issubset({'verb', 'ptcp', 'ptca'})
                and p(1,'nu gn')==p(0,'nu gn')
                and p(0,'st')=='a'
                else None),

            (p(2)
                if p(1,'sp')=='art'
                and p(2,'sp') in ('subs','nmpr','adjv')
                and p(2,'nu gn')==p(0,'nu gn')
                and p(0,'st')=='a'
                else None),
        ])
        
        self.mom['adjv'] = Getter(adjv_m)[0]
        self.kid['adjv'] = Getter(adjv_k)[0]
        
        # ** preposition patterns (mother only) **
        prep_m = clear([ 
            
            (p(-1)
                if p(-1,'sp')=='prep' or p(-1) in custom_preps
                else None),

            (p(-2)
                if p(-1,'sp')=='art'
                and p(-2,'sp')=='prep' or p(-2) in custom_preps
                else None),
        ])
        
        self.mom['prep'] = Getter(prep_m)[0]
    
        # coordinate patterns
        # NB: before == mom; after == kid
        
        # -- mom
        coord_m = clear([
            (p(-2) if p(-1)=='conj'
                else None),

            (p(-3) if p(-1)=='art'
                and p(-2)=='conj'
                else None),

            (p(-3) if p(-1)=='prep'
                and p(-2)=='conj' else None),

            (p(-4) if p(-1)=='art'
                and p(-2)=='prep' or p(-2) in custom_preps
                and p(-3)=='conj'
                else None),
        ])
        
        # -- kid
        coord_k = clear([
            
            (p(2)
                if p(1)=='conj'
                else None),
            
            (p(3)
                if p(1)=='conj'
                and p(2)=='art'
                else None),
            
            (p(3)
                if p(1)=='conj'
                and p(2)=='prep' or p(2) in custom_preps
                and p(3,'sp')!='art'
                else None),
            
            (p(4)
                if p(1)=='conj'
                and p(2)=='prep' or p(2) in custom_preps
                and p(3, 'sp')=='art'
                else None),
            
        ])
        
        self.mom['coord'] = Getter(coord_m)[0]
        self.kid['coord'] = Getter(coord_k)[0]
    
        # quantifier patterns
        
        # -- mom
        quant_m = clear([
            
            (p(-1)
                if p(0) in custom_quants
                and p(-1, 'sp') in {'subs', 'adjv'}
                else None),
            
            (p(-2)
                if p(-1,'sp')=='art'
                and p(0) in custom_quants
                else None),
            
            (p(1) 
                if p(0) in custom_quants
                and p(1, 'sp') in {'subs', 'adjv'}
                else None),
            
            (p(2) if p(0) in custom_quants
                and p(1, 'sp')=='art'
                else None),
        
        ])
        
        # -- kid
        quant_k = clear([
            
            (p(1) 
                if p(0) not in custom_quants
                and p(1,'sp') in {'subs', 'adjv'}
                and p(1) in custom_quants
                else None),

            (p(2) 
                if p(0) not in custom_quants
                and p(1,'sp')=='art'
                and p(2,'sp') in {'subs', 'adjv'}
                and p(2) in custom_quants
                else None),

            (p(-1) 
                if p(0) not in custom_quants
                and p(-1,'sp') in {'subs', 'adjv'}
                and p(-1) in custom_quants
                else None),

            (p(-2) 
                if p(0) not in custom_quants
                and p(-1,'sp')=='art'
                and p(-2,'sp') in {'subs', 'adjv'}
                and p(-2) in custom_quants
                else None),
            
        ])
        
        self.mom['quant'] = Getter(quant_m)[0]
        self.kid['quant'] = Getter(quant_k)[0]

In [193]:
t = Context(17596)

In [196]:
t.kid['adjv']

In [168]:
A.show(A.search('''

phrase_atom
    word pdp=subs st=a
    <: word lex=H
    <: word sp=subs

'''), withNodes=True, extraFeatures='pdp st', condenseType='phrase_atom', end=10)

  1.49s 194 results


This machinery will allow us to write large yet concise conditional statements that test all kinds of parameters around the context.

Let's make a set of all NPs in the corpus from which we can gradually work from. We will work with phrase_atoms for now.

In [19]:
nps = set(F.typ.s('NP')) & set(F.otype.s('phrase_atom'))
covered = set()

def prog():
    # print remaining cases
    print(len(nps)-len(covered))

print(len(nps))

47504


Let's eliminate all options that have no other choices but a candidate.

In [20]:
simpleres = []

for p in nps:
    cands = [w for w in L.d(p, 'word') if F.sp.v(w) in cand_sps]
    if len(cands) == 1:
        simpleres.append((p, cands[0]))
        
len(simpleres)

27766

Let's inspect the cases to be sure. You can run the next cell to shuffle the results. This helps to identify problems that may be widespread.

In [21]:
random.shuffle(simpleres)
A.table(simpleres, end=20, withNodes=True)

n,p,phrase_atom,word
1,1_Chronicles 26:10,בְכֹ֔ור 1159966,בְכֹ֔ור 404431
2,Proverbs 7:15,פָּ֝נֶ֗יךָ 1126566,פָּ֝נֶ֗יךָ 348726
3,Ezekiel 37:12,עַמִּ֑י 1081555,עַמִּ֑י 283450
4,Deuteronomy 11:17,הָ֣אֲדָמָ֔ה 964651,אֲדָמָ֔ה 99951
5,1_Samuel 20:35,הַשָּׂדֶ֖ה 998057,שָּׂדֶ֖ה 154184
6,2_Samuel 12:20,לֶ֖חֶם 1006398,לֶ֖חֶם 166868
7,Leviticus 21:21,הַכֹּהֵ֔ן 944667,כֹּהֵ֔ן 65291
8,Job 1:14,הָאֲתֹנֹ֖ות 1117088,אֲתֹנֹ֖ות 336329
9,Psalms 34:11,כְּ֭פִירִים 1102369,כְּ֭פִירִים 315677
10,Jeremiah 25:34,יְמֵיכֶ֖ם 1058935,יְמֵיכֶ֖ם 247783


These cases are great. We add them and continue on with the quest.

In [22]:
covered |= set(simpleres)

In [23]:
prog() # progress

19738


**From now on, things get complicated.** We need a function that tests a candidate's syntactic autonomy. That is done by looking at the positions before and after the candidate within the phrase_atom and by applying classical rules of Hebrew noun syntax.

### Valid Pair Disambiguation

In many cases a valid coordinate pair is invalidated due to the immediately preceding term. An example might be found in Hosea 1:2:

> Hosea 1:2 אֵ֤שֶׁת זְנוּנִים֙ וְיַלְדֵ֣י זְנוּנִ֔ים

Here both אשׁה and ילד should be selected as semantic heads. But the algorithm would see ינונים וילדי and assume that ילד is connected with ינונים rather than אשׁה. As a result, the algorithm will evaluate ילד as coordinate with a non-head word and thus eliminate it as a candidate. 

How might we disambiguate such cases?

I propose that the item `B` in an `A & B` coordination are more semantically similar than other words. One way to detect this is to find examples of `A & B` pairs across the corpus where there are no intervening words.

***Shallow Method***
> if `A & B`<br>
> then `A of C & B == A & B`

I do not yet know whether this method alone is powerful enough to disambiguate all cases. As a last resort, we may eventually require vector semantic data. Then we could say:

***Vector Semantic Method***
> if `A ~ B`<br>
> then `A of C & B == A & B`

#### The Shallow Method

Herein we make mappings of acceptable lexeme pairings. Everywhere in the HB that `A & B` is found, we make a mapping of `A` to `B` and `B` to `A`. We can maximize the method's effectiveness by also recording pairwise relations across all pairs in a conjunction chain, so that:

> given `A & B & C` <br>
> then `A ~ C` <br>
> and `B ~ C` <br>
> etc.

In [24]:
subs = {'subs', 'nmpr', 'adjv'}

def conj_climber(a):
    '''
    Climbs down conjunction chains recursively
    and yields the words. Start with first word.
    '''
    yield a
    
    pos = Positions(a, 'phrase')
    
    b = (
        (pos[2] if F.sp.v(pos[1])=='conj' and F.sp.v(pos[2])!='art' and F.sp.v(pos[2]) in subs else None)
    
        or (pos[3] if F.sp.v(pos[-1])=='art' and F.sp.v(pos[1])=='conj'
                and F.sp.v(pos[2])=='art' and F.sp.v(pos[2]) in subs else None)
    
        or (pos[3] if F.sp.v(pos[-1])=='prep' and F.sp.v(pos[1])=='conj'
                and F.sp.v(pos[2])=='prep' and F.sp.v(pos[3]) in subs else None)
    
        or (pos[3] if F.sp.v(pos[-1])=='art' and F.sp.v(pos[-2])=='prep' 
               and F.sp.v(pos[1])=='conj' and F.sp.v(pos[2])=='art' and F.sp.v(pos[3])=='prep'
               and F.sp.v(pos[4]) in subs else None)
        )
    
    if b:
        yield from conj_climber(b)
        

covered = set() # skip items already matched
valid_pairs = collections.defaultdict(set)

for w in F.otype.s('word'):
    
    # skip words already visited in a chain
    if w in covered:
        continue
        
    # check for chain
    chain = list(conj_climber(w))
    if not chain:
        continue
        
    # add pairs
    for i in chain:
        for j in chain:
            
#             if F.lex.v(i)=='>RY/' and F.lex.v(j) == '<WP/':
#                 raise Exception(w, chain)
            
            if i == j:
                continue
            valid_pairs[F.lex.v(i)].add(F.lex.v(j))
            
print(len(valid_pairs), 'valid pairs added...')

3025 valid pairs added...


In [25]:
# expand the set

expanded_pairs = collections.defaultdict(set)

for lex, pairs in valid_pairs.items():
    for paired in pairs:
        expanded_pairs[lex] |= valid_pairs[paired]

print(len(valid_pairs), 'valid pairs added...')

3025 valid pairs added...


In [245]:
#expanded_pairs['>RY/']

In [249]:
'DGH/' in expanded_pairs['<WP/']

False

In [261]:
valid_pairs['QWL/']

{'<NN/',
 '>JC/',
 '>RJH/',
 'BKJ/',
 'BRD/',
 'BRQ/',
 'FFWN/',
 'KNWR/',
 'MVR/',
 'R<C/',
 'R<M/',
 'TRW<H/',
 'TWDH/',
 'XTN/',
 'XYYRH/'}

In [251]:
valid_pairs['CMJM/']

{'>LHJM/', '>RY/', 'BJN/', 'CMJM/', 'DG/', 'KL/', 'KSJL=/', 'RMF/', 'XJH/'}

In [240]:
valid_pairs['CMJM/']

{'>LHJM/', '>RY/', 'BJN/', 'CMJM/', 'DG/', 'KL/', 'KSJL=/', 'RMF/', 'XJH/'}

In [225]:
T.nodeFromSection(('Genesis', 1, 26))

1414379

In [248]:
F.lex.v(507)

'DGH/'

The example below shows that I will need a more robust disambiguation method. I am now considering options.

In [229]:
A.pretty(651834, withNodes=True)

### The Function

In [123]:
ptcp = {'ptca', 'ptcp'} # participles

# gather all candidates
cands = [w for w in F.otype.s('word')
            if F.sp.v(w) in cand_sps-{'verb'}] 
cands.extend(w for w in F.otype.s('word')
                if F.sp.v(w) == 'verb' and F.vt.v(w) in ptcp)

def val_head(cand, talk=False):
    '''
    Checks for the syntactic autonomy of a 
    provided NP head candidate (wordnode).
    Arguments are loaded into a list with 
    the `check` function, which returns a 
    requested item if the arguments are valid.
    '''
    
    if talk:
        print(f'\tprocessing {T.text(cand or 0)}', cand)
    
    # for recursive calls within phrase
    if cand == None:
        return False
    
    # get word positions around candidate
    pos = Positions(cand, 'phrase_atom')
    sp = F.sp.v(cand)
    good = cand in cands
    
    # rules for classifying particular parts of speech
    pos_rules = {
        
        'verb': F.vt.v(cand) in ptcp,
        'prde': not any([F.sp.v(pos[-1]) in {'art', 'subs', 'adjv'},
                         F.sp.v(pos[1] in {'subs', 'adjv'})])
                }
    
    # construct pattern
    const = clear([pos[-1] if F.st.v(pos[-1])=='c' else None,
                   pos[-2] if F.st.v(pos[-2])=='c' and F.sp.v(pos[-1])=='art' else None
                  ])
    

    # adjective patterns
    adjv = clear([
                  pos[-2] if (F.sp.v(pos[-1])=='art' and pos[-2] in cands) else None,
                  pos[-1] if val_head(pos[-1]) else None
        
                 ])
    
    # prepositional modifier pattern
    prep = clear([
                  pos[-2] if (F.sp.v(pos[-1])=='prep' and val_head(pos[-2])) else None,
                  pos[-3] if (F.sp.v(pos[-1])=='art' and F.sp.v(pos[-2])=='prep'\
                                  and val_head(pos[-3])) else None
                ])
    
    # coordinate check
    coord = clear([pos[-2] if F.sp.v(pos[-1])=='conj' and not val_head(pos[-2]) else None,
                   pos[-3] if (F.sp.v(pos[-1])=='art' and F.sp.v(pos[-2])=='conj'\
                                   and not val_head(pos[-3])) else None,
                   pos[-1] if not val_head(pos[-1]) and pos[-1] in cands else None # adjacent coordination: e.g. dog, cat, and man
                  ])
    
    # quant check
    quant = any([cand in quantifiers and pos[1] not in quantifiers,
                 cand in quantifiers and pos[-1] not in quantifiers,
                 cand in quantifiers and F.sp.v(pos[1])=='art' and pos[2] not in quantifiers,
                ])
    
    # independence check
    # quantifiers and modifiers are handled here
    inde = []
    inde.extend(c for c in const if not (c in quantifiers and cand not in quantifiers))
    inde.extend(a for a in adjv if not (a in quantifiers and cand not in quantifiers))
    inde.extend(c for c in coord if not (c in quantifiers and cand not in quantifiers))
    inde_check = not inde and not prep and not quant
    
    # check all rules
    if talk:
        print('\t\t', inde)
        print({'adjv': adjv,})
        print('\t\t', good, pos_rules.get(sp, True),
                inde_check)
    
    return all([
                good, 
                pos_rules.get(sp, True),
                inde_check
               ])

In [84]:
# get complicated test phrases

testers = []

for p in nps:
    
    cands = [w for w in L.d(p, 'word') if F.sp.v(w) in cand_sps]
    
    if len(cands) == 4:
        testers.append((p,))
        
len(testers)

892

In [126]:
random.shuffle(testers)

In [127]:
test = testers[0][0]


for p in testers[:5]:
    
    heads = []
    
    test = p[0]
    
    for w in L.d(test, 'word'):

        if val_head(w):
            heads.append(w)
    
    print()
    A.plain(test)
    print()
    print(f'\t\t\t\tHEADs: {[T.text(h) for h in heads]}')
    print()
    print('-'*15)





				HEADs: ['אֵ֤שֶׁת ']

---------------




				HEADs: ['אֱ֠לֹהֵי ']

---------------




				HEADs: ['שַׂ֕ר ', 'פֶּ֥לֶךְ ']

---------------




				HEADs: ['עֹלֹ֥ות ']

---------------




				HEADs: ['יֹ֣ום ']

---------------


In [274]:
# A.table(A.search('''

# phrase
#     word pdp=subs st=c ls#card lex#BN/ sp#verb
#     <: word pdp=subs st=a
#     <: word sp=conj
#     <: word pdp=subs st=a
    
# '''), withNodes=True)