# Going Subphraseless

The current method for isolating phrase heads ([here](https://nbviewer.jupyter.org/github/ETCBC/heads/blob/master/phrase_heads.ipynb)) requires strenuous and ineloquent processing of BHSA subphrase relations. The subphrases are not always consistently encoded and suffer from numerous exceptional cases. The result is that the method is rather convoluted and ineloquent.

This notebook will explore the possibility of disconnecting semantic head analysis from the ETCBC subphrase encoding. 

A "semantic" head is the primary content word of a phrase, following Croft's "Primary Information Bearing Unit":

> **The noun and the verb are the PRIMARY INFORMATION_BEARING UNITS (PIBUs) of the phrase and clause respectively. In common parlance, they are the content words. PIBUs have major informational content that functional elements such as articles and [auxiliaries] do not have. (Croft, *Radical Construction Grammar*, 2001, 258; see also Shead, *Radical Frame Semantics and Biblical Hebrew*, 104)**

> **A (semantic) head is the profile equivalent that is the primary information-bearing unit, that is, the most contentful item that most closely profiles the same kind of thing that the whole constituent profiles. (ibid., 259)**

Croft also provides an additional criterion to "profile equivalence":

> **If the criterion of profile equivalence produces two candidates for headhood, the less schematic meaning is the PIBU; that is, the PIBU is the one with the narrower extension, in the formal semantic sense of that term (ibid., 259)**

## Inquiry

Can we isolate semantic phrase heads in BHSA using only the phrase_atom and phrase limits? This question indeed means that we  take the phrase_atom/phrase boundaries for granted. Empirically, the validity of BHSA phrase boundaries needs to be tested. But for now, the exercise of isolating semantic phrase heads could be seen as the first step towards reproducible phrase boundaries.

## Basic Concepts

A semantic head will most often stand in a syntactically independent position. For Hebrew nominal phrases, that essentially means a word which is not precided by a construct, and which is semantically central (excluding attributive slots (e.g. H + noun + H + ATTRIBUTIVE) or an adjectival slots (e.g. noun + noun as in אישׁ טוב).

Quantifier expressions present unique cases, which may be syntactically independent but semantically secondary. These are expressed through specialized lexical items such as cardinal numbers and qualitative quantifiers (e.g.  "כל" and "חצי").

Another complication is the use of nouns as prepositional items. Such uses can be seen with words like פני "face" such as לפני "in front," and even words like ראשׁ as in ראשׁ החדשׁ "beginning of the month." 

Other expressions of quantity, quality, and function provide similar complexities. These cases have to be specified in advance.

### Ambiguity

Considerable ambiguity is present in several of cases:

**`A B and C`**<br>
Given A, B, C == nominal words. Is their relationship `A // B // C` or `A+B // C`. In other words: **what is the relationship of two adjacent nominal words given a list?** Is B a descriptor of A or is it an independent element? 

**`A of B and C`**<br>
Is it, `(A of B) // (C)` or `(A of (B // C)`

Or even:

**`A of B C and D`**<br>
This pattern combines elements from both ambiguous cases.

To address these ambiguities we will apply a battery of disambiguation attempts. Some of those attempts will draw from corpus data, i.e. do we ever see `B and C` with the conjunction explicitly elsewhere in the corpus? Or do we ever see a `A of C` excplicitly in the corpus? Accents may also play a role: do we see a conjunctive or disjunctive accent between `B C`? 

## Prerequisites

A number of pre-defined word sets are needed for processing quantification and ambiguous adjacency. These sets are made available in the form of `wsets`, a dictionary containing word sets that are calculated in to the `wordsets` directory of this repository. The following wordsets have been defined:

* nominals – a set of word nodes with parts of speech and participles that have the potential to function as nominalized elements. The selected parts of speech are quite permissive: `{'subs', 'nmpr', 'adjv', 'advb', 'prde', 'prps', 'prin', 'inrg'}`. Since parts of speech are not taken as universal linguistic categories but only summaries of language-specific word tendencies (cf. Croft, *Radical Construction Grammar*, 2001), we consider that almost any part of speech can be used in a nominal pattern (or construction). There are some upper limits to this assumption, though. For instance, we exclude cojunctions, articles, prepositions, and negators. 
* prepositions – a word set consisting of words with a part of speech category of `prep`, a lexical set (`ls`) feature of `ppre` ("potential preposition"), as well as a select group of nouns like פני "face" which have been processed for prepositionality. 
* quantifiers - consists of word nodes that are cardinal numbers or qualitative quantifiers such as כל.
* mword – mapping from a word to its phonological word group ("masoretic word"); joins words on maqqeph and ø space
* accent_type – a mapping from a word to its accent type: conjunctive or disjunctive
* conj_pairs – a dict of observed conjunction pairings of lexemes in the corpus: `A & B`
* cons_pairs – a dict of observed construct pairings of lexemes in the corpus: `A of B`
* mom – mapping from word node to its mother word node for a specified relationship: `mom[A]['coord'] = B`
* kid – opposite of mom; mapping from word to its children nodes for a relationship: `kid[A]['cons'] = B`

**Let's get started**. We load the necessary functions and BHSA data (straight from source).

In [2]:
import collections
import pickle
import random
import re
from IPython.display import display, HTML
from datetime import datetime
from pprint import pprint
from tf.app import use
wsets = pickle.load(open('wordsets/wsets.pickle', 'rb'))
A = use('bhsa', hoist=globals(), silent=True)

   |     0.00s No structure info in otext, the structure part of the T-API cannot be used


### Wordsets

In [3]:
wsets.keys()

dict_keys(['noms', 'preps', 'quants', 'accent_type', 'mwords', 'conj_pairs', 'cons_pairs', 'mom', 'kid'])

In [4]:
list(wsets['cons_pairs'].keys())[:10]

[]

### Machinery

We could use some machinery to do the hard work of looking in and around a node. In the older approach we used TF search templates. But these are not very efficient at scale, and they are always bound by the limits of the query language. I take another approach here: a class which, when called, can give all sorts of contextual information about any node which is fed in. This class is related to the `Mom` class used in wordsets ([here](wordsets/context.py)), and in fact it depends on the output of that class which is stored in the `mom` and `kid` word sets.

The "machines" we use are defined in the positions module. They are series of position testers for a node and string evaluators. By writing our conditions as strings, we can maintain a mapping between the truth conditions and their results. This is intended to aid debugging and the process of elimination which we apply herein.

In [5]:
from wordsets.positions import Getter, Positions, Evaluator, showconds
from wordsets.context import Mom, Relas

### Mom

In [6]:
# relas = Relas(A, **wsets)
# mom = relas.mom
# kid = relas.kid

## Construction

Let's begin constructing the heads evaluator.

In [7]:
# Make a positions mapping for high performance
# position checking 

mapping = collections.defaultdict(lambda:collections.defaultdict())

for pa in F.otype.s('phrase_atom'):
    phrase_words = L.d(pa, 'word')
    for wi in phrase_words:
        for wj in phrase_words:
            pos = wj-wi
            mapping[wi][pos] = wj
            
print('done!')

done!


In [33]:
phrases = [ph for ph in F.otype.s('phrase_atom') 
               if F.typ.v(ph) in {'NP', 'PP'}
               and len(L.d(ph, 'word')) > 1
          ]

words = [w for ph in phrases for w in L.d(ph, 'word')]

print(len(phrases) * len(words))

16567350018


In [34]:
len(words)

220551

In [35]:
len(phrases)

75118

In [54]:
class QwikPositions:
    
    def __init__(self, node, contextdict, tf):
        self.Fs = tf.api.Fs
        self.node = node
        self.contextdict = contextdict
        
    def get(self, position, *feats):
        if not feats:
            return self.contextdict[self.node].get(position, None)
        else:
            return set(self.Fs(f).v(self.node) for f in feats)

def prettyconds(condsdict):
    '''
    Iterate through an explain dict for a rela
    and print out all of checked conditions.
    '''
    for source, searches in condsdict.items():
        for search in searches:
            src, target, conds = search
            print(f'{source} -> {target}')
            for cond, value in conds.items():
                print('\t{:<30} {:>30}'.format(cond, str(value)))
            print()
        
def search_phrases(relastr, show=10, end=None, typs={'NP', 'PP'}):
    '''
    Searches phrases with the specified relation 
    and prints out their descriptive explanation.
    '''
    
    start = datetime.now()
    print("beginning search")
    
    matches = []
    phrases = [ph for ph in F.otype.s('phrase_atom') 
                   if F.typ.v(ph) in typs
                   and len(L.d(ph, 'word')) > 1
              ]
    random.shuffle(phrases)
    
    for i,ph in enumerate(phrases):
        
        if end and len(matches) == end:
            break
            
        if not i%5000:
            print(f'{len(matches)} found ({i}/{len(phrases)})')
        
        for w in L.d(ph, 'word'):
            R = Relations(w, Grammar, A).analyze()
            if R and R[0] == relastr:
                matches.append(R)
        
    # display
    print('done at', datetime.now() - start)
    print(len(matches), 'matches found...')
    print('showing', end)
    for match in matches[:show]:
        
        rela, src, tgt, cond = match
        phrase = L.u(src, 'phrase_atom')[0]
        
        highlights = {src:'pink',
                      tgt:'lightgreen'}
        
        A.pretty(phrase, withNodes=True, extraFeatures='sp st', highlights=highlights)
        print(cond)
        display(HTML('<hr>'))

        
class Grammar:
    
    def __init__(self, wsets):
        noms = wsets['noms']
        quants = wsets['quants']
        preps = wsets['preps']
        self.rules = {} # put searches here
        
        self.rules['cons'] = (
            
            {
                'src': lambda P: P(0),
                'tgt': lambda P: P(-1),
                'cnd': {
                    
                    'P(-1, st) == c': 
                        lambda P: P(-1,'st') == 'c',
                    
                    'P(-1) not in quants|preps': 
                        lambda P: P(-1) not in quants|preps,
                }
            },
            
            {
                'src': lambda P: P(0),
                'tgt': lambda P: P(-2),
                'cnd': {
                    
                    'P(-1,sp) == art': 
                        lambda P: P(-1,'sp') == 'art',
                    
                    'P(-2, st) == c': 
                        lambda P: P(-2,'st') == 'c',
                    
                    'P(-2) not in quants|preps': 
                        lambda P: P(-2) not in quants|preps,
                }
                
            }
        )

        self.rules['adjv'] = (
                    
            {
                'src': lambda P: P(0),
                'tgt': lambda P: P(-1),
                'cnd': {
                    
                    'P(0, sp) == adjv':
                        lambda P: P(0,'sp') == 'adjv',
                    
                    'P(-1, st) & {NA, a}': 
                        lambda P: P(-1,'st') in {'NA', 'a'},
                    
                    'P(-1) in noms':
                        lambda P: P(-1) in noms,
                    
                    'P(-1) not in {quants|preps}':
                        lambda P: P(-1) not in quants|preps,
                    
#                     'self.search(P(-1), self.searches[adjv])':
#                         lambda P: self.search(P(-1), self.searches['adjv'])
                }
            },
            
            {
                'src': lambda P: P(0),
                'tgt': lambda P: P(1),
                'cnd': {
                    
                    'P(0, sp) == advb':
                        lambda P: P(0,'sp') == 'advb',
                    
                    'P(1) in noms':
                        lambda P: P(1) in noms,
                }
            },
        
        )
        
    def get(self):
        return self.rules
        
class Relations:
    
    '''
    Tags relations around a word.
    '''
    
    def __init__(self, w, grammar, tf):
        
        # TF and sets
        self.tf = tf
        self.F, self.L = tf.api.F, tf.api.L
        self.w = w
        
        # run the analysis
        self.rules = grammar(wsets).get()
        self.analyze()
        
    def analyze(self):
        '''
        Identifies relations on a word in a phrase
        that match a supplied set of conditions.
        Requires a relation string and a series 
        of search dicts with keys: {src, tgt, cnd}
        
        -- src -- 
        the origin of the relation, position zero, i.e. P(0)
        
        -- tgt-- 
        the target of the relation, position +/- from origin
        
        --cnd-- 
        a set of conditional strings in a tuple which should 
        be evaluated against the populated namespace.
        Search populates the namespace of a custom Evaluation class before
        evaluating the strings. Strings are used instead of raw code
        so that a mapping can be preserved between a condition and its 
        truth value. 
        '''
        
        # set up positions and namespaces
        P = Positions(self.w, 'phrase_atom', self.tf).get
        
        # find first match
        for rela, rulesets in self.rules.items():
            for rule in rulesets: 
                source = rule['src'](P)
                target = rule['tgt'](P)
                conds = {cond:test(P) for cond, test in rule['cnd'].items()}
                if all(conds.values()):
                    return (rela, source, target, conds)

In [55]:
search_phrases('adjv', show=100, end=50)

beginning search
0 found (0/75118)
done at 0:01:12.315068
50 matches found...
showing 50


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == advb': True, 'P(1) in noms': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == advb': True, 'P(1) in noms': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == advb': True, 'P(1) in noms': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == advb': True, 'P(1) in noms': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == advb': True, 'P(1) in noms': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == advb': True, 'P(1) in noms': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == advb': True, 'P(1) in noms': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == advb': True, 'P(1) in noms': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


{'P(0, sp) == adjv': True, 'P(-1, st) & {NA, a}': True, 'P(-1) in noms': True, 'P(-1) not in {quants|preps}': True}


<br>
<br>
<br>
<br>
<br>
<br>
<br>
<hr>

In [95]:
# A.table(A.search('''

# word st=a

# '''), end=5)

### Testing

This machinery will allow us to write large yet concise conditional statements that test all kinds of parameters around the context.

Let's make a set of all NPs in the corpus from which we can gradually work from. We will work with phrase_atoms for now.

In [19]:
nps = set(F.typ.s('NP')) & set(F.otype.s('phrase_atom'))
covered = set()

def prog():
    # print remaining cases
    print(len(nps)-len(covered))

print(len(nps))

47504


Let's eliminate all options that have no other choices but a candidate.

In [20]:
simpleres = []

for p in nps:
    cands = [w for w in L.d(p, 'word') if F.sp.v(w) in cand_sps]
    if len(cands) == 1:
        simpleres.append((p, cands[0]))
        
len(simpleres)

27766

In [9]:
def get_quantified(word, tf, **wsets):
    '''
    Recursively calls down a quantifier chain until
    finding a quantified word.
    '''
    
    quants, noms = wsets['quants'], wsets['noms']

    P = Positions(word, 'phrase_atom', tf).get

    target = (
        lambda n: P(n), 
        lambda n: P(n) not in quants,
        lambda n: P(n, 'sp') in noms,
    )

    # check this word
    if all(cond(0) for cond in target):
        return word
    
    # check next word
    elif all(cond(P(1)) for cond in target):
        return P(1)
    
    # move up one
    else:
        return get_quantified(P(1))