# Going Subphraseless

The current method for isolating phrase heads ([here](https://nbviewer.jupyter.org/github/ETCBC/heads/blob/master/phrase_heads.ipynb)) requires strenuous and ineloquent processing of BHSA subphrase relations. The subphrases are not always consistently encoded and suffer from numerous exceptional cases. The result is that the method is rather convoluted and ineloquent.

This notebook will explore the possibility of disconnecting semantic head analysis from the ETCBC subphrase encoding. 

A "semantic" head is the primary content word of a phrase, following Croft's "Primary Information Bearing Unit":

> **The noun and the verb are the PRIMARY INFORMATION_BEARING UNITS (PIBUs) of the phrase and clause respectively. In common parlance, they are the content words. PIBUs have major informational content that functional elements such as articles and [auxiliaries] do not have. (Croft, *Radical Construction Grammar*, 2001, 258; see also Shead, *Radical Frame Semantics and Biblical Hebrew*, 104)**

> **A (semantic) head is the profile equivalent that is the primary information-bearing unit, that is, the most contentful item that most closely profiles the same kind of thing that the whole constituent profiles. (ibid., 259)**

Croft also provides an additional criterion to "profile equivalence":

> **If the criterion of profile equivalence produces two candidates for headhood, the less schematic meaning is the PIBU; that is, the PIBU is the one with the narrower extension, in the formal semantic sense of that term (ibid., 259)**

## Inquiry

Can we isolate semantic phrase heads in BHSA using only the phrase_atom and phrase limits? This question indeed means that we  take the phrase_atom/phrase boundaries for granted. Empirically, the validity of BHSA phrase boundaries needs to be tested. But for now, the exercise of isolating semantic phrase heads could be seen as the first step towards reproducible phrase boundaries.

## Basic Concepts

A semantic head will most often stand in a syntactically independent position. For Hebrew nominal phrases, that essentially means a word which is not precided by a construct, and which is semantically central (excluding attributive slots (e.g. H + noun + H + ATTRIBUTIVE) or an adjectival slots (e.g. noun + noun as in אישׁ טוב).

Quantifier expressions present unique cases, which may be syntactically independent but semantically secondary. These are expressed through specialized lexical items such as cardinal numbers and qualitative quantifiers (e.g.  "כל" and "חצי").

Another complication is the use of nouns as prepositional items. Such uses can be seen with words like פני "face" such as לפני "in front," and even words like ראשׁ as in ראשׁ החדשׁ "beginning of the month." 

Other expressions of quantity, quality, and function provide similar complexities. These cases have to be specified in advance.

### Ambiguity

Considerable ambiguity is present in several of cases:

**`A B and C`**<br>
Given A, B, C == nominal words. Is their relationship `A // B // C` or `A+B // C`. In other words: **what is the relationship of two adjacent nominal words given a list?** Is B a descriptor of A or is it an independent element? 

**`A of B and C`**<br>
Is it, `(A of B) // (C)` or `(A of (B // C)`

Or even:

**`A of B C and D`**<br>
This pattern combines elements from both ambiguous cases.

To address these ambiguities we will apply a battery of disambiguation attempts. Some of those attempts will draw from corpus data, i.e. do we ever see `B and C` with the conjunction explicitly elsewhere in the corpus? Or do we ever see a `A of C` excplicitly in the corpus? Accents may also play a role: do we see a conjunctive or disjunctive accent between `B C`? 

## Prerequisites

A number of pre-defined word sets are needed for processing quantification and ambiguous adjacency. These sets are made available in the form of `wsets`, a dictionary containing word sets that are calculated in to the `wordsets` directory of this repository. The following wordsets have been defined:

* nominals – a set of word nodes with parts of speech and participles that have the potential to function as nominalized elements. The selected parts of speech are quite permissive: `{'subs', 'nmpr', 'adjv', 'advb', 'prde', 'prps', 'prin', 'inrg'}`. Since parts of speech are not taken as universal linguistic categories but only summaries of language-specific word tendencies (cf. Croft, *Radical Construction Grammar*, 2001), we consider that almost any part of speech can be used in a nominal pattern (or construction). There are some upper limits to this assumption, though. For instance, we exclude cojunctions, articles, prepositions, and negators. 
* prepositions – a word set consisting of words with a part of speech category of `prep`, a lexical set (`ls`) feature of `ppre` ("potential preposition"), as well as a select group of nouns like פני "face" which have been processed for prepositionality. 
* quantifiers - consists of word nodes that are cardinal numbers or qualitative quantifiers such as כל.
* mword – mapping from a word to its phonological word group ("masoretic word"); joins words on maqqeph and ø space
* accent_type – a mapping from a word to its accent type: conjunctive or disjunctive
* conj_pairs – a dict of observed conjunction pairings of lexemes in the corpus: `A & B`
* cons_pairs – a dict of observed construct pairings of lexemes in the corpus: `A of B`
* mom – mapping from word node to its mother word node for a specified relationship: `mom[A]['coord'] = B`
* kid – opposite of mom; mapping from word to its children nodes for a relationship: `kid[A]['cons'] = B`

**Let's get started**. We load the necessary functions and BHSA data (straight from source).

In [1]:
import collections
import pickle
import random
import re
from IPython.display import display, HTML
from datetime import datetime
from pprint import pprint
from tf.app import use
wsets = pickle.load(open('wordsets/wsets.pickle', 'rb'))
A = use('bhsa', hoist=globals(), silent=True)

   |     0.00s No structure info in otext, the structure part of the T-API cannot be used


### Wordsets

In [2]:
wsets.keys()

dict_keys(['noms', 'preps', 'quants', 'accent_type', 'mwords', 'conj_pairs', 'cons_pairs'])

In [3]:
list(wsets['cons_pairs'].keys())[:10]

['>JC/', 'KL/', 'BN/', 'TPF[', 'MZBX/', 'B>R/', '<T/', 'XRB/', 'RWX/', 'MLK/']

In [4]:
#wsets['conj_pairs']['>JC/']

# Machinery

We could use some machinery to do the hard work of looking in and around a node. In the older approach we used TF search templates. But these are not very efficient at scale, and they are always bound by the limits of the query language. I take another approach here: a class which, when called, can give all sorts of contextual information about any node which is fed in. This class is related to the `Mom` class used in wordsets ([here](wordsets/context.py)), and in fact it depends on the output of that class which is stored in the `mom` and `kid` word sets.

The "machines" we use are defined in the positions module. They are series of position testers for a node and string evaluators. By writing our conditions as strings, we can maintain a mapping between the truth conditions and their results. This is intended to aid debugging and the process of elimination which we apply herein.

In [5]:
from wordsets.positions import Getter, Positions, Evaluator, showconds

### Search Cache For Keeping Queries

In [6]:
run = input('refresh cache?')

if run == 'y':
    cache = []
else:
    pass

refresh cache? y


In [86]:
def prettyconds(rulesets):
    '''
    Iterate through an explain dict for a rela
    and print out all of checked conditions.
    '''
    
    for ruleset in rulesets:
        name, src, tgt = ruleset['name'], ruleset['src'], ruleset['tgt']
        print(name)
        print(f'{src} -> {tgt}')
        
        for cond, value in ruleset['cnd'].items():
            print('{:<30} {:>30}'.format(cond, str(value)))
        
        print()
        
def showmatch(match):
    
    '''
    Displays a match from a Grammar test.
    '''
    
    hit, conds = match['match'], match['conds']
    
    if not hit:
        print('NO MATCHES')
        print('-'*20)
        A.pretty(L.u(conds[0]['src'], 'phrase_atom')[0], extraFeatures='sp st', withNodes=True)
        prettyconds(conds)
        return None

    name, src, tgt = hit['name'], hit['src'], hit['tgt']

    phrase = L.u(src, 'phrase_atom')[0]

    highlights = {src:'pink',
                  tgt:'lightgreen'}

    A.pretty(phrase, withNodes=True, extraFeatures='sp st', highlights=highlights)
    prettyconds(conds)
    display(HTML('<hr>'))
        
def test_search(relastr, show=10, end=None, name='', phrases=None):
    '''
    Searches phrases with the specified relation 
    and prints out their descriptive explanation.
    '''
    
    start = datetime.now()
    print('beginning search')
    
    # build a convenient test set of words
    phrases = phrases or [ph for ph in F.otype.s('phrase_atom') 
                   if F.typ.v(ph) in {'NP', 'PP'}
                   and len(L.d(ph, 'word')) > 1
              ]
    words = [w for ph in phrases for w in L.d(ph, 'word')
                if w in wsets['noms']
                and F.language.v(w) == 'Hebrew'
            ]
    
    # random shuffle to get good diversity of examples
    random.shuffle(words)
    
    # set up grammar
    G = Grammar(wsets, A)
    
    matches = []
    append = matches.append
    
    # iterate and find matches on words
    for i,w in enumerate(words):

        # update every 5000 iterations
        if i%5000 == 0:
            print(f'\t{len(matches)} found ({i}/{len(words)})')
        
        # run grammar search
        test = G.tests[relastr](w)
        
        # save results
        if test['match']:
            if not name:
                append(test)
            elif test['match']['name'] == name:
                append(test)
            
        # stop at end
        if len(matches) == end:
            break
        
        
    # display
    print('done at', datetime.now() - start)
    print(len(matches), 'matches found...')
    print('showing', end)
    
    for match in matches[:show]:
        showmatch(match)
        
    cache.append(matches)

## Grammar

This class contains patterns needed to classify constructions in Hebrew phrases.

In [130]:
class Grammar:
    
    def __init__(self, wsets, tf):
        self.tf = tf
        self.wsets = wsets
        self.tests = {
            'cons': self.cons,
            'adjv': self.adjv,
            'disj': self.disj,
            'quant': self.quant,
        }
        # cache results for recursive calls?
        self.cache = collections.defaultdict(lambda:collections.defaultdict())
        
    def getP(self, w):
        return Positions(w, 'phrase_atom', self.tf).get
    
    def test(self, condtuple):
        test = [ruleset for ruleset in condtuple
                    if all(ruleset['cnd'].values())
               ]
        
        # NB rules are written so as to default
        # on the last-most matching item
        # this allows for more complex situations
        # to default if true
        if test:
            return test[-1]
        else:
            return {}
        
    def everymatch(self, w):
        '''
        Runs analysis for all tests on a word.
        Returns as dict of test:result.
        '''
        results = {}
        
        for name, test in self.tests.items():
            res = test(w)
            if res['match']:
                results[name] = res['match']
                
        return results
        
    def cons(self, w):
        
        '''
        Queries for construct relations
        on a word.
        '''
        
        params = {'match': {}, 'conds': None}
        
        if not w:
            return params
        
        P = self.getP(w)
        wsets = self.wsets
        quants, preps, noms = wsets['quants'], wsets['preps'], wsets['noms']
        
        params['conds'] = (

            {
                'name': 'cons',
                'src': P(0),
                'tgt': P(-1),
                'cnd': {

                    'P(-1, st) == c': 
                        P(-1,'st') == 'c',

                    'P(-1) not in quants|preps': 
                        P(-1) not in quants|preps,
                }
            },

            {
                'name': 'cons',
                'src': P(0),
                'tgt': P(-2),
                'cnd': {

                    'P(-1,sp) == art': 
                        P(-1,'sp') == 'art',

                    'P(-2, st) == c': 
                        P(-2,'st') == 'c',

                    'P(-2) not in quants|preps': 
                        P(-2) not in quants|preps,
                }
            }
        )

        params['match'] = self.test(params['conds'])
        return params

    def adjv(self, w):
        
        '''
        Searches for adjective relas on a word.
        '''
                
        params = {'match': {}, 'conds': None}
        
        if not w:
            return params
        
        P = self.getP(w)
        wsets = self.wsets
        quants, preps, noms = wsets['quants'], wsets['preps'], wsets['noms']
        
        # check for recursive adjective matches 
        a2match = self.adjv(P(-1))
        
        common = {
            
            'not self.disj(P(0))[match]':
                not self.disj(P(0))['match'],
            
            'P(1) in noms':
                P(-1) in noms,
            
            'P(-1, st) & {NA, a}': 
                P(-1,'st') in {'NA', 'a'},   
            
            'P(-1) not in {quants|preps}':
                P(-1) not in quants|preps,
        }
                
        params['conds'] = (
            
            {
                'name': 'adjv',
                'src': P(0),
                'tgt': P(-1),
                'cnd': dict(common, **{

                    'P(0, sp) in {adjv, verb}':
                        P(0,'sp') in {'adjv', 'verb'},

                })
            },
            
            {
                'name': 'adjv adjv',
                'src': P(0),
                'tgt': a2match.get('tgt', None),
                'cnd': dict(common, **{
                    
                    'P(0,sp) in {adjv, verb}':
                        P(0,'sp') in {'adjv', 'verb'},
                    
                     'self.adjv(P(-1)) and target != P(0)':
                        bool(a2match['match']) and a2match.get('tgt', None) != P(0)
                })
            },
            
           {
                'name': 'adverb',
                'src': P(0),
                'tgt': P(1),
                'cnd': {

                    'P(0,sp) == advb':
                        P(0,'sp') == 'advb',

                    'P(1) in noms':
                        P(1) in noms,
                }
            },
            
            {
                'name': 'attrib pattern',
                'src': P(0),
                'tgt': P(-2),
                'cnd': {
                    
                    'P(0) in noms':
                        P(0) in noms,
                    
                    'P(-1,sp) == art':
                        P(-1,'sp') == 'art',
                    
                    'P(-2) in noms':
                        P(-2) in noms,
                    
                    'P(-2) not in quants':
                        P(-2) not in quants,
                    
                    'P(-2,st) in {NA, a}':
                        P(-2,'st') in {'NA', 'a'},
                    
                    'P(-2,sp) != advb':
                        P(-2,'sp') != 'advb',
                    
                }
                
            }

        )

        params['match'] = self.test(params['conds'])
        return params
        
    def disj(self, w):
        
        '''
        Queries for disjunct patterns, wherein A || B
        and B does not connect to A 
        A number of constructions are covered by this function:
            * lists - parallel items in a list
            * apposition - parallel items where B further specifies A
        '''
        
        params = {'match': {}, 'conds': None}
        
        if not w:
            return params
        
        P = self.getP(w)
        wsets = self.wsets
        quants, preps, noms = wsets['quants'], wsets['preps'], wsets['noms']
        conjpairs = wsets['conj_pairs']
        atype = wsets['accent_type']
        mword = wsets['mwords']
        
        # conditions common to several parameters
        common = {
            
           'P(0) in noms':
                P(0) in noms,

            'not {P(0), P(-1)} & quants':
                not {P(0), P(-1)} & quants,

            'P(-1) in noms':
                P(-1) in noms,

            'P(-1, st) == a': 
                P(-1,'st') == 'a',
            
        }
        
        # build the conditions
        params['conds'] = (

            {
                'name': 'disjoint conjpair',
                'src': P(0),
                'tgt': P(-1),
                'cnd': dict(common, **{
                                        
                    'P(-1,lex) in conjpairs[P(0,lex)]':
                        P(-1,'lex') in conjpairs[P(0,'lex')],
                })
            },
            
            {
                'name': 'disjoint accent',
                'src': P(0),
                'tgt': P(-1),
                'cnd': dict(common, **{
                    
                    'P(-1) == disjunct and P(-1) not in mword':
                        atype.get(P(-1)) == 'disjunct' and P(-1) not in mword[w],
                })
            },
        )

        params['match'] = self.test(params['conds'])
        return params
        
    def quant(self, w):
        
        params = {'match': {}, 'conds': None}
        
        if not w:
            return params
        
        P = self.getP(w)
        wsets = self.wsets
        quants, noms = wsets['quants'], wsets['noms']
        get_quanted = self.get_quantified(w)
        
        qqbackward = self.quant(P(-1))
        
        common = {
            'P(0) in quants':
                P(0) in quants,
        }
        
        params['conds'] = (
        
            {
                'name': 'quant forward',
                'src': P(0),
                'tgt': get_quanted,
                'cnd': dict(common, **{
                    
                    'bool(get_quanted)':
                        bool(get_quanted)
                })
            },
            
            {
                'name': 'quant backward',
                'src': P(0),
                'tgt': P(-1),
                'cnd': dict(common, **{
                
                    'not bool(get_quanted)':
                        not bool(get_quanted),
                    
                    'P(-1,st) in {NA, a}':
                        P(-1,'st') in {'NA', 'a'},
                    
                    'P(-1) in noms':
                        P(-1) in noms,
                    
                    'P(-1) not in quants':
                        P(-1) not in quants
                    
                })
            },
            
            {
                'name': 'quant backward w/ article',
                'src': P(0),
                'tgt': P(-2),
                'cnd': dict(common, **{
                
                    'not bool(get_quanted)':
                        not bool(get_quanted),
                    
                    'P(-1, sp) == art':
                        P(-1,'sp') == 'art',
                    
                    'P(-2,st) in {NA, a}':
                        P(-2,'st') in {'NA', 'a'},
                    
                    'P(-2) in noms':
                        P(-2) in noms,
                    
                    'P(-1) not in quants':
                        P(-2) not in quants
                    
                })
            },
        
            {
                'name': 'quant quant backward',
                'src': P(0),
                'tgt': qqbackward['match'].get('tgt', None),
                'cnd': dict(common, **{
                
                    'not bool(get_quanted)':
                        not bool(get_quanted),
                    
                    'bool(qqbackward[match])':
                        bool(qqbackward['match']),
                })
            },
            
        
        )
        
        params['match'] = self.test(params['conds'])
        return params
    
    def get_quantified(self, word):
        '''
        Locates the first non-quant nominal word
        ahead of a quantifier.
        '''

        if not word:
            return None
        
        quants, noms = self.wsets['quants'], self.wsets['noms']
        
        P = G.getP(word)

        conds = (
            P(0), 
            P(0) not in quants,
            P(0) in noms,
        )

        # check this word
        if all(conds):
            return word

        # move up one
        else:
            return self.get_quantified(P(1))

# Need for Semantic Data

The accurate processing of word connections depends on fuller semantic data, which should be stored in word sets (`wsets`). 

For example, in the two phrases

> (Exod 25:39) ככר זהב טהור <br>
> (2 Sam 24:24) בכסף שקלים חמשׁים

we see that זהב and כסף, despite being in two different positions with two different words indicates a kind of "composed of" semantic concept: "round gold" (i.e. round composed of gold) and "silver shekels" (shekels composed of silver). To process these kinds of links, we need a list of nouns that often function as "material".

## A Compromise: Time Phrases

Since constructing these semantic classes is vastly time consuming, I want to start with a smaller set of cases. I will instead focus on parsing connections within time phrases for now. This is because I am analyzing time phrases in my current ongoing PhD project. 

In [120]:
G = Grammar(wsets, A)

In [121]:
G.everymatch(2474)

{'quant': {'name': 'quant forward',
  'src': 2474,
  'tgt': 2475,
  'cnd': {'P(0) in quants': True, 'bool(get_quanted)': True}}}

In [111]:
showmatch(G.tests['quant'](2474))

quant forward
2474 -> 2475
P(0) in quants                                           True
bool(get_quanted)                                        True

quant backward
2474 -> 2473
P(0) in quants                                           True
not bool(get_quanted)                                   False
P(-1,st) in {NA, a}                                      True
P(-1) in noms                                           False
P(-1) not in quants                                      True

quant backward w/ article
2474 -> 2472
P(0) in quants                                           True
not bool(get_quanted)                                   False
P(-1, sp) == art                                        False
P(-2,st) in {NA, a}                                      True
P(-2) in noms                                            True
P(-1) not in quants                                      True

quant quant backward
2474 -> None
P(0) in quants                                           True
n

In [107]:
#G.tests['quant'](195863)

## Testing

In [97]:
phrases = [ph for ph in F.otype.s('phrase_atom') 
                   if F.typ.v(ph) in {'NP', 'PP'}
                   and F.function.v(L.u(ph, 'phrase')[0]) == 'Time'
                   and len(L.d(ph, 'word')) > 1
            ]

In [129]:
test_search('quant', name='', show=100, end=1, phrases=phrases)

beginning search
	0 found (0/6606)
done at 0:00:00.045571
1 matches found...
showing 1


quant forward
175608 -> 175609
P(0) in quants                                           True
bool(get_quanted)                                        True

quant backward
175608 -> 175607
P(0) in quants                                           True
not bool(get_quanted)                                   False
P(-1,st) in {NA, a}                                     False
P(-1) in noms                                           False
P(-1) not in quants                                      True

quant backward w/ article
175608 -> 175606
P(0) in quants                                           True
not bool(get_quanted)                                   False
P(-1, sp) == art                                        False
P(-2,st) in {NA, a}                                      True
P(-2) in noms                                           False
P(-1) not in quants                                      True

quant quant backward
175608 -> None
P(0) in quants                                   

## Map Relations on Timephrase Words

These relations will be used to calculate heads and export edges.

In [122]:
relas = {}

phrases = [ph for ph in F.otype.s('phrase_atom')
              if F.function.v(L.u(ph,'phrase')[0]) == 'Time'
          ]

for phrase in phrases:
    noms = [w for w in L.d(phrase, 'word') if w in wsets['noms']]
    
    if not noms:
        raise Exception(T.text(phrase), phrase)

Exception: ('טֶ֚רֶם ', 905188)

In [123]:
L.d(905188,'word')

(750,)

In [128]:
F.ls.v(750)

'ppre'

In [127]:
750 in wsets['preps']

True