# Going Subphraseless

The current method for isolating phrase heads ([here](https://nbviewer.jupyter.org/github/ETCBC/heads/blob/master/phrase_heads.ipynb)) requires strenuous and ineloquent processing of BHSA subphrase relations. The subphrases are not always consistently encoded and suffer from numerous exceptional cases. The result is that the method is rather convoluted and ineloquent.

This notebook will explore the possibility of disconnecting semantic head analysis from the ETCBC subphrase encoding. 

A "semantic" head is the primary content word of a phrase, following Croft's "Primary Information Bearing Unit":

> **The noun and the verb are the PRIMARY INFORMATION_BEARING UNITS (PIBUs) of the phrase and clause respectively. In common parlance, they are the content words. PIBUs have major informational content that functional elements such as articles and [auxiliaries] do not have. (Croft, *Radical Construction Grammar*, 2001, 258; see also Shead, *Radical Frame Semantics and Biblical Hebrew*, 104)**

> **A (semantic) head is the profile equivalent that is the primary information-bearing unit, that is, the most contentful item that most closely profiles the same kind of thing that the whole constituent profiles. (ibid., 259)**

Croft also provides an additional criterion to "profile equivalence":

> **If the criterion of profile equivalence produces two candidates for headhood, the less schematic meaning is the PIBU; that is, the PIBU is the one with the narrower extension, in the formal semantic sense of that term (ibid., 259)**

## Inquiry

Can we isolate semantic phrase heads in BHSA using only the phrase_atom and phrase limits? This question indeed means that we  take the phrase_atom/phrase boundaries for granted. Empirically, the validity of BHSA phrase boundaries needs to be tested. But for now, the exercise of isolating semantic phrase heads could be seen as the first step towards reproducible phrase boundaries.

## Basic Concepts

A semantic head will most often stand in a syntactically independent position. For Hebrew nominal phrases, that essentially means a word which is not precided by a construct, and which is semantically central (excluding attributive slots (e.g. H + noun + H + ATTRIBUTIVE) or an adjectival slots (e.g. noun + noun as in אישׁ טוב).

Quantifier expressions present unique cases, which may be syntactically independent but semantically secondary. These are expressed through specialized lexical items such as cardinal numbers and qualitative quantifiers (e.g.  "כל" and "חצי").

Another complication is the use of nouns as prepositional items. Such uses can be seen with words like פני "face" such as לפני "in front," and even words like ראשׁ as in ראשׁ החדשׁ "beginning of the month." 

Other expressions of quantity, quality, and function provide similar complexities. These cases have to be specified in advance.

### Ambiguity

Considerable ambiguity is present in several of cases:

**`A B and C`**<br>
Given A, B, C == nominal words. Is their relationship `A // B // C` or `A+B // C`. In other words: **what is the relationship of two adjacent nominal words given a list?** Is B a descriptor of A or is it an independent element? 

**`A of B and C`**<br>
Is it, `(A of B) // (C)` or `(A of (B // C)`

Or even:

**`A of B C and D`**<br>
This pattern combines elements from both ambiguous cases.

To address these ambiguities we will apply a battery of disambiguation attempts. Some of those attempts will draw from corpus data, i.e. do we ever see `B and C` with the conjunction explicitly elsewhere in the corpus? Or do we ever see a `A of C` excplicitly in the corpus? Accents may also play a role: do we see a conjunctive or disjunctive accent between `B C`? 

## Prerequisites

A number of pre-defined word sets are needed for processing quantification and ambiguous adjacency. These sets are made available in the form of `wsets`, a dictionary containing word sets that are calculated in to the `wordsets` directory of this repository. The following wordsets have been defined:

* nominals – a set of word nodes with parts of speech and participles that have the potential to function as nominalized elements. The selected parts of speech are quite permissive: `{'subs', 'nmpr', 'adjv', 'advb', 'prde', 'prps', 'prin', 'inrg'}`. Since parts of speech are not taken as universal linguistic categories but only summaries of language-specific word tendencies (cf. Croft, *Radical Construction Grammar*, 2001), we consider that almost any part of speech can be used in a nominal pattern (or construction). There are some upper limits to this assumption, though. For instance, we exclude cojunctions, articles, prepositions, and negators. 
* prepositions – a word set consisting of words with a part of speech category of `prep`, a lexical set (`ls`) feature of `ppre` ("potential preposition"), as well as a select group of nouns like פני "face" which have been processed for prepositionality. 
* quantifiers - consists of word nodes that are cardinal numbers or qualitative quantifiers such as כל.
* mword – mapping from a word to its phonological word group ("masoretic word"); joins words on maqqeph and ø space
* accent_type – a mapping from a word to its accent type: conjunctive or disjunctive
* conj_pairs – a dict of observed conjunction pairings of lexemes in the corpus: `A & B`
* cons_pairs – a dict of observed construct pairings of lexemes in the corpus: `A of B`
* mom – mapping from word node to its mother word node for a specified relationship: `mom[A]['coord'] = B`
* kid – opposite of mom; mapping from word to its children nodes for a relationship: `kid[A]['cons'] = B`

**Let's get started**. We load the necessary functions and BHSA data (straight from source).

In [156]:
import collections
import pickle
import random
import re
from IPython.display import display, HTML
from pprint import pprint
from tf.app import use
wsets = pickle.load(open('wordsets/wsets.pickle', 'rb'))
A = use('bhsa', hoist=globals(), silent=True)

   |     0.00s No structure info in otext, the structure part of the T-API cannot be used


### Wordsets

In [2]:
wsets.keys()

dict_keys(['noms', 'preps', 'quants', 'accent_type', 'mwords', 'conj_pairs', 'cons_pairs', 'mom', 'kid'])

In [3]:
list(wsets['cons_pairs'].keys())[:10]

[]

### Machinery

We could use some machinery to do the hard work of looking in and around a node. In the older approach we used TF search templates. But these are not very efficient at scale, and they are always bound by the limits of the query language. I take another approach here: a class which, when called, can give all sorts of contextual information about any node which is fed in. This class is related to the `Mom` class used in wordsets ([here](wordsets/context.py)), and in fact it depends on the output of that class which is stored in the `mom` and `kid` word sets.

The "machines" we use are defined in the positions module. They are series of position testers for a node and string evaluators. By writing our conditions as strings, we can maintain a mapping between the truth conditions and their results. This is intended to aid debugging and the process of elimination which we apply herein.

In [5]:
from wordsets.positions import Getter, Positions, Evaluator, showconds
from wordsets.context import Mom, Relas

### Mom

In [87]:
# relas = Relas(A, **wsets)
# mom = relas.mom
# kid = relas.kid

## Construction

Let's begin constructing the heads evaluator.

In [158]:
class Relations:
    
    '''
    Tags relations within a phrase or phrase_atom.
    '''
    
    def __init__(self, phrase, wsets, tf):
        
        # TF and sets
        self.tf = tf
        self.phrase = phrase
        self.F, self.T, self.L = tf.api.F, tf.api.T, tf.api.L
        self.wsets = wsets
        self.noms = wsets['noms']
        self.quants = wsets['quants']
        self.preps = wsets['preps']
        
        # evaluators
        self.evaler = Evaluator(locals())
        self.eval = self.evaler.evaluate
        self.conddict = self.evaler.conddict
    
        # put relations here
        self.rela = collections.defaultdict(lambda: collections.defaultdict())
        self.explain = collections.defaultdict(lambda: collections.defaultdict())
        
        # run relation taggings
        self.construct()
    
    def search(self, relastr, *searchdicts):
        '''
        Identifies relations in the phrase
        that match a supplied set of conditions.
        Requires a relation string and a series 
        of search dicts with keys: {src, tgt, cnd}
        
        -- src -- 
        the origin of the relation, position zero, i.e. P(0)
        
        -- tgt-- 
        the target of the relation, position +/- from origin
        
        --cnd-- 
        a set of conditional strings in a tuple which should 
        be evaluated against the populated namespace.
        Search populates the namespace of a custom Evaluation class before
        evaluating the strings. Strings are used instead of raw code
        so that a mapping can be preserved between a condition and its 
        truth value. 
        '''
        
        # cycle through phrase words
        for w in self.L.d(self.phrase, 'word'):
            
            # set up positions and namespaces
            noms, quants, preps = self.noms, self.quants, self.preps
            P = Positions(w, self.F.otype.v(self.phrase), self.tf).get
            self.evaler.update(locals())
            
            # evaluate conditions
            matches = []
            evaled = [] # keep track for explain
            for search in searchdicts:  
                source = self.eval(search['src'])
                target = self.eval(search['tgt'])
                conds = self.conddict(*search['cnd'])
                evaled.append((source, target, conds)) # for explain
                if all(conds.values()):
                    matches.append((source, target))

            # assume first valid match is good
            match = Getter(matches)[0]
            
            if match:
                
                # unpack match data
                source, target = match
                
                # assign match data
                self.rela[relastr][source] = target
                self.explain[relastr][source] = evaled
                

    def construct(self):

        self.search('cons',
                    
            {
                'src': 'P(0)',
                'tgt': 'P(1)',
                'cnd': (
                    "P(0,'st') == 'c'",
                    "P(0) not in quants|preps",
                    "P(1,'sp') != 'art'"
                )
            },
            
            {
                'src': 'P(0)',
                'tgt': 'P(2)',
                'cnd': (
                    "P(0,'st') == 'c'",
                    "P(0, 'st') not in quants|preps",
                    "P(1, 'sp') == 'art'"
                )
                
            }
        )
        
    def adjective(self):
        
        self.search('adjc',
                    
            {
                'src': 'P(0)',
                'tgt': 'P(-1)',
                'cnd': (
                    "P(0,'st') in {'NA', 'a'}",
                    "P(0, 'st') in noms",
                    "P(0,'sp') in {'adjv', 'advb'}",
                )
            }
        
        )
        
def prettyconds(condsdict):
    '''
    Iterate through an explain dict for a rela
    and print out all of checked conditions.
    '''
    for source, searches in condsdict.items():
        for search in searches:
            src, target, conds = search
            print(f'{source} -> {target}')
            for cond, value in conds.items():
                print('\t{:<30} {:>30}'.format(cond, str(value)))
            print()
        
def search_phrases(relastr, end=10):
    '''
    Searches phrases with the specified relation 
    and prints out their descriptive explanation.
    '''
    
    matches = []
    for ph in F.otype.s('phrase_atom'):
        R = Relations(ph, wsets, A)
        if R[relastr]:
            matches.append(R)
        
    # display
    for match in matches[:end]:
        A.pretty(match.phrase, withNodes=True)
        prettyconds(match.rela[relastr])
        display(HTML('<hr>'))

In [126]:
test = 904941

R = Relations(test, wsets,  A)

In [155]:
prettyconds(R.explain['cons'])

307 -> 308
	P(0,'st') == 'c'                                         True
	P(0) not in quants|preps                                 True
	P(1,'sp') != 'art'                                      False

307 -> 309
	P(0,'st') == 'c'                                         True
	P(0, 'st') not in quants|preps                           True
	P(1, 'sp') in {'art'}                                    True



In [79]:
for rel, ex in R.explain[25].items():
    print(rel)
    pprint(ex)

cons
[(25,
  26,
  {'P(0) not in quants': True,
   "P(0,'st') == 'c'": True,
   "P(1,'sp') != 'art'": True})]


In [125]:
A.table(A.search('''

word st=a

'''), end=5)

  0.66s 245354 results


n,p,word
1,Genesis 1:1,בְּ
2,Genesis 1:1,בָּרָ֣א
3,Genesis 1:1,אֵ֥ת
4,Genesis 1:1,הַ
5,Genesis 1:1,וְ


In [107]:
T.text(t)

'לְמֶמְשֶׁ֣לֶת הַלַּ֔יְלָה '

In [108]:
t

(904941,)

### Testing

This machinery will allow us to write large yet concise conditional statements that test all kinds of parameters around the context.

Let's make a set of all NPs in the corpus from which we can gradually work from. We will work with phrase_atoms for now.

In [19]:
nps = set(F.typ.s('NP')) & set(F.otype.s('phrase_atom'))
covered = set()

def prog():
    # print remaining cases
    print(len(nps)-len(covered))

print(len(nps))

47504


Let's eliminate all options that have no other choices but a candidate.

In [20]:
simpleres = []

for p in nps:
    cands = [w for w in L.d(p, 'word') if F.sp.v(w) in cand_sps]
    if len(cands) == 1:
        simpleres.append((p, cands[0]))
        
len(simpleres)

27766

In [9]:
def get_quantified(word, tf, **wsets):
    '''
    Recursively calls down a quantifier chain until
    finding a quantified word.
    '''
    
    quants, noms = wsets['quants'], wsets['noms']

    P = Positions(word, 'phrase_atom', tf).get

    target = (
        lambda n: P(n), 
        lambda n: P(n) not in quants,
        lambda n: P(n, 'sp') in noms,
    )

    # check this word
    if all(cond(0) for cond in target):
        return word
    
    # check next word
    elif all(cond(P(1)) for cond in target):
        return P(1)
    
    # move up one
    else:
        return get_quantified(P(1))