# Going Subphraseless

The current method for isolating phrase heads ([here](https://nbviewer.jupyter.org/github/ETCBC/heads/blob/master/phrase_heads.ipynb)) requires strenuous and ineloquent processing of BHSA subphrase relations. The subphrases are not always consistently encoded and suffer from numerous exceptional cases. The result is that the method is rather convoluted and ineloquent.

This notebook will explore the possibility of disconnecting semantic head analysis from the ETCBC subphrase encoding. 

A "semantic" head is the primary content word of a phrase, following Croft's "Primary Information Bearing Unit":

> **The noun and the verb are the PRIMARY INFORMATION_BEARING UNITS (PIBUs) of the phrase and clause respectively. In common parlance, they are the content words. PIBUs have major informational content that functional elements such as articles and [auxiliaries] do not have. (Croft, *Radical Construction Grammar*, 2001, 258; see also Shead, *Radical Frame Semantics and Biblical Hebrew*, 104)**

> **A (semantic) head is the profile equivalent that is the primary information-bearing unit, that is, the most contentful item that most closely profiles the same kind of thing that the whole constituent profiles. (ibid., 259)**

Croft also provides an additional criterion to "profile equivalence":

> **If the criterion of profile equivalence produces two candidates for headhood, the less schematic meaning is the PIBU; that is, the PIBU is the one with the narrower extension, in the formal semantic sense of that term (ibid., 259)**

## Inquiry

Can we isolate semantic phrase heads in BHSA using only the phrase_atom and phrase limits? This question indeed means that we  take the phrase_atom/phrase boundaries for granted. Empirically, the validity of BHSA phrase boundaries needs to be tested. But for now, the exercise of isolating semantic phrase heads could be seen as the first step towards reproducible phrase boundaries.

## Basic Concepts

A semantic head will most often stand in a syntactically independent position. For Hebrew nominal phrases, that essentially means a word which is not precided by a construct, and which is semantically central (excluding attributive slots (e.g. H + noun + H + ATTRIBUTIVE) or an adjectival slots (e.g. noun + noun as in אישׁ טוב).

Quantifier expressions present unique cases, which may be syntactically independent but semantically secondary. These are expressed through specialized lexical items such as cardinal numbers and qualitative quantifiers (e.g.  "כל" and "חצי").

Another complication is the use of nouns as prepositional items. Such uses can be seen with words like פני "face" such as לפני "in front," and even words like ראשׁ as in ראשׁ החדשׁ "beginning of the month." 

Other expressions of quantity, quality, and function provide similar complexities. These cases have to be specified in advance.

### Ambiguity

Considerable ambiguity is present in several of cases:

**`A B and C`**<br>
Given A, B, C == nominal words. Is their relationship `A // B // C` or `A+B // C`. In other words: **what is the relationship of two adjacent nominal words given a list?** Is B a descriptor of A or is it an independent element? 

**`A of B and C`**<br>
Is it, `(A of B) // (C)` or `(A of (B // C)`

Or even:

**`A of B C and D`**<br>
This pattern combines elements from both ambiguous cases.

To address these ambiguities we will apply a battery of disambiguation attempts. Some of those attempts will draw from corpus data, i.e. do we ever see `B and C` with the conjunction explicitly elsewhere in the corpus? Or do we ever see a `A of C` excplicitly in the corpus? Accents may also play a role: do we see a conjunctive or disjunctive accent between `B C`? 

## Prerequisites

A number of pre-defined word sets are needed for processing quantification and ambiguous adjacency. These sets are made available in the form of `wsets`, a dictionary containing word sets that are calculated in to the `wordsets` directory of this repository. The following wordsets have been defined:

* nominals – a set of word nodes with parts of speech and participles that have the potential to function as nominalized elements. The selected parts of speech are quite permissive: `{'subs', 'nmpr', 'adjv', 'advb', 'prde', 'prps', 'prin', 'inrg'}`. Since parts of speech are not taken as universal linguistic categories but only summaries of language-specific word tendencies (cf. Croft, *Radical Construction Grammar*, 2001), we consider that almost any part of speech can be used in a nominal pattern (or construction). There are some upper limits to this assumption, though. For instance, we exclude cojunctions, articles, prepositions, and negators. 
* prepositions – a word set consisting of words with a part of speech category of `prep`, a lexical set (`ls`) feature of `ppre` ("potential preposition"), as well as a select group of nouns like פני "face" which have been processed for prepositionality. 
* quantifiers - consists of word nodes that are cardinal numbers or qualitative quantifiers such as כל.
* mword – mapping from a word to its phonological word group ("masoretic word"); joins words on maqqeph and ø space
* accent_type – a mapping from a word to its accent type: conjunctive or disjunctive
* conj_pairs – a dict of observed conjunction pairings of lexemes in the corpus: `A & B`
* cons_pairs – a dict of observed construct pairings of lexemes in the corpus: `A of B`
* mom – mapping from word node to its mother word node for a specified relationship: `mom[A]['coord'] = B`
* kid – opposite of mom; mapping from word to its children nodes for a relationship: `kid[A]['cons'] = B`

**Let's get started**. We load the necessary functions and BHSA data (straight from source).

In [61]:
import sys
import collections
import pickle
import random
import re
import itertools
import copy
from IPython.display import display, HTML
from datetime import datetime
from pprint import pprint
from tf.app import use
wsets = pickle.load(open('wordsets/wsets.pickle', 'rb'))
A = use('bhsa', hoist=globals(), silent=True)
A.displaySetup(condenseType='phrase', withNodes=True, extraFeatures='st')

   |     0.00s No structure info in otext, the structure part of the T-API cannot be used


### Wordsets

In [2]:
wsets.keys()

dict_keys(['noms', 'preps', 'quants', 'accent_type', 'mwords', 'conj_pairs', 'cons_pairs'])

In [3]:
list(wsets['cons_pairs'].keys())[:10]

['>JC/', 'KL/', 'BN/', 'TPF[', 'MZBX/', 'B>R/', '<T/', 'XRB/', 'RWX/', 'MLK/']

In [4]:
#wsets['conj_pairs']['>JC/']

# Machinery

We could use some machinery to do the hard work of looking in and around a node. In the older approach we used TF search templates. But these are not very efficient at scale, and they are always bound by the limits of the query language. I take another approach here: a set of classes that specify locations and directions within a specified context.

In [5]:
from wordsets.langtools import Positions, Walker, Dummy

## `Positions`

The `Positions` class enables concise access to adjacent nodes within a given context. This allows us to write algorithms with query-like efficiency with all of the power of Python. 

This class is instantiated on a word node and can provide contextual look-up data for a given word. For example, given a phrase containing the following word nodes:

> (189681, 189682, **189683**, 189684, 189685, 189686) <br>

representing the following phrase (space separated for clarity):

> ב שׁנת **שׁלשׁים** ו שׁמנה שׁנה

Given that the bolded node, `189683` is our `source` word, we instantiate the class, feeding in the node, the "phrase_atom" string (which is the context we want to search within), and an instance of Text-Fabric (`tf`):

In [6]:
      #    source node    context  TF instance  
      #         |            |       |
P = Positions(189683, 'phrase_atom', A).get

If we want to obtain the word adjacent one space forward, we simply ask `P` for `1`, which gives us the next word in the phrase.

In [7]:
P(1)

189684

If we try to ask for 4 words forward, we go beyond the bounds of the phrase. But `P` handles this by returning nothing:

In [8]:
P(4)

To look back one word, we simply give a negative value:

In [9]:
P(-1)

189682

Finally, `P` can be used to quickly call features on these words. For instance, in order to get the lexeme of the word two words in front of `189683`:

In [10]:
P(2,'lex')

'CMNH/'

And if we want to get a number of features, we can just add other features to the arguments. The result is a feature set:

In [11]:
P(2, 'lex', 'nu')

{'CMNH/', 'sg'}

`P` can also handle features on the source node itself by giving a positionality of `0`:

In [12]:
P(0, 'lex')

'CLC/'

## `Walker`

`Walker` performs a similar function to `Positions`, except it is ambiguous to exact positions, walking either `ahead` or `back` from the source to a target node in the context. A function must be supplied that returns `True` on the target node.

We instantiate the `Walker` using the same source and context as above.

In [13]:
      #  source node    context  TF instance  
      #       |            |       |
Wk = Walker(189683, 'phrase_atom', A)

`Walker` is demonstrated below with the same word. A simple `lambda` function is used to test for the lexeme. In the example below, we find the first word ahead of `189683` that is a cardinal number:

In [14]:
Wk.ahead(lambda w: F.ls.v(w) == 'card')

189685

An alternative demonstrates the `None` returned on the lack of a valid match.

In [15]:
Wk.ahead(lambda w: F.ls.v(w) == 'BOOGABOOGA')

Another example wherein we walk backwards to the preposition:

In [16]:
Wk.back(lambda w: F.sp.v(w) == 'prep')

189681

We can also specify that the walk should be interrupted under certain conditions with a `stop` function. In this case we walk forward to the next cardinal number, but the walk is interrupted when the `stop` function detects a conjunction.

In [17]:
Wk.ahead(lambda w: F.ls.v(w) == 'card',
         stop=lambda w: F.sp.v(w) == 'conj')

We can also specify the opposite with a `go` function argument, which defines the nodes that allowed to intervene between `source` and `target`. Below we specify that *only* a conjunction should intervene.

In [18]:
Wk.ahead(lambda w: F.ls.v(w) == 'card',
         go=lambda w: F.sp.v(w) == 'conj')

189685

The `go` and `stop` functions can be as permissive or strict as desired.

Finally, we can tell `Walker` that the output of the validation function should be returned instead of the node itself with the optional argument `output=True`:

In [19]:
val_funct = lambda w: F.ls.v(w) if F.ls.v(w)=='card' else None

Wk.ahead(val_funct, output=True)

'card'

This ability is useful for certain tests.

## `Dummy`

When writing conditions and logic, we want an object that passively receives `NoneType`s or zero `int`s without throwing errors. Such an object should also return `None` to reflect its `False` value. `Dummy`, provides such functionality. `Dummy` can receive all of the arguments, kwargs, and function calls as a `Positions` or `Walker` object. But it returns absolutely nothing. Ouch.

In [20]:
D = Dummy(None, 'phrase_atom', A)

The function call below returns `None`:

In [21]:
D.get(1)

As does this:

In [22]:
D.get(1, 'lex')

And even this:

In [23]:
D.ahead(1)

`D` is essentially a souless void that consumes whatever you throw at it and gives nothing in return.

For safe-calls on a `Position` or `Walker` object, assign nodes to it via a function with a `Dummy` given on null nodes:

In [24]:
def getPos(node, context, tf):
    """A function to get Positions safely."""
    if node:
        return Positions(node, context, tf)
    else:
        return Dummy() # <- give dummy on empty node

So:

In [25]:
P = getPos(None, 'phrase_atom', A)
P.get(1)

Or:

In [26]:
P = getPos(1, 'phrase_atom', A)
P.get(1)

2

# Need for Semantic Data

The accurate processing of word connections depends on fuller semantic data than BHSA provides. Future semantic data could be stored in a similar way to word sets (`wsets`). 

For example, in the two phrases

> (Exod 25:39) ככר זהב טהור <br>
> (2 Sam 24:24) בכסף שקלים חמשׁים

we see that זהב and כסף, despite being in two different positions with two different words indicates a kind of "composed of" semantic concept: "round gold" (i.e. round composed of gold) and "silver shekels" (shekels composed of silver). To process these kinds of links, we need a list of nouns that often function as "material." But this is only the beginning. Many other words will have specific semantic values that motivate their syntactic behavior. Such a scope lies outside the bounds of this author's current project on Hebrew time phrases.

## A Compromise: Time Phrases

Since constructing these semantic classes is vastly time consuming, I want to start with a smaller set of cases. I will instead focus on parsing connections within time phrases for now. This is because I am analyzing time phrases in my current ongoing PhD project. 

In [27]:
timephrases = [ph for ph in F.otype.s('phrase_atom') 
                   if F.function.v(L.u(ph, 'phrase')[0]) == 'Time'
                   and len(L.d(ph, 'word')) > 2
                   and F.language.v(L.d(ph, 'word')[0]) == 'Hebrew'
            ]

print(f'{len(timephrases)} phrases ready')

2328 phrases ready


## Search & Display Functions

The functions below allow for fast searching and displaying of queries using the `Positions` class. The searches rely on the `Grammar` class, described further below.

In [131]:
def prettyconds(cases):
    '''
    Iterate through an explain dict for a rela
    and print out all of checked conditions.
    '''
    
    for case in cases:
        print('cnds:')
        for cond, value in case['conds'].items():
            print('{:<30} {:>30}'.format(cond, str(value)))
        
        print()
        
def showmatch(cx):
    '''
    Displays a match from a Grammar test.
    '''
    
    firstword = list(cx.slots)[0]
    phrase = L.u(firstword, 'phrase_atom')[0]
    
    if not cx:
        print('NO MATCHES')
        print('-'*20)
        A.pretty(phrase, extraFeatures='sp st', withNodes=True)
        prettyconds(cx.conds)
        return None

    colors = itertools.cycle(['pink', 'lightblue', 
                              'yellow', 'lightgreen'])
    highlights = {}
    role2color = {}
    
    for role, slots in cx.role2slots.items():
        color = next(colors)
        role2color[role] = color
        for slot in slots:
            highlights[slot] = color
    
    A.pretty(phrase, withNodes=True, extraFeatures='sp st', highlights=highlights)
    # reveal color meanings
    for role,color in role2color.items():
        colmean = '<div style="background: {}; text-align: center">{}</div>'.format(color, role)
        display(HTML(colmean))
    
    pprint(cx.unfoldroles(), indent=4)
    print()
    
    prettyconds(cx.cases)
    display(HTML('<hr>'))
        
def test_search(relastr, show=10, end=None, name='', phrases=None):
    '''
    Searches phrases with the specified relation 
    and prints out their descriptive explanation.
    '''
    
    start = datetime.now()
    print('beginning search')
    
    # build a convenient test set of words
    phrases = phrases or [ph for ph in F.otype.s('phrase_atom') 
                   if F.typ.v(ph) in {'NP', 'PP'}
                   and len(L.d(ph, 'word')) > 1
              ]
    words = [w for ph in phrases for w in L.d(ph, 'word')]
    
    # random shuffle to get good diversity of examples
    random.shuffle(words)
    
    # set up grammar
    G = NounGrammar(wsets, A)
    
    matches = []
    append = matches.append
    
    # iterate and find matches on words
    for i,w in enumerate(words):

        # update every 5000 iterations
        if i%5000 == 0:
            print(f'\t{len(matches)} found ({i}/{len(words)})')
        
        # run grammar search
        test = G.tests[relastr](w)
        
        # save results
        if test['match']:
            if not name:
                append(test)
            elif test['match']['name'] == name:
                append(test)
            
        # stop at end
        if len(matches) == end:
            break
        
        
    # display
    print('done at', datetime.now() - start)
    print(len(matches), 'matches found...')
    print('showing', end)
    
    for match in matches[:show]:
        showmatch(match)

## Constructions classes

While `Positions` provides concise access to context, A `Constructions` class contains a series of functions which test a bunch of conditions. The conditions are formed by testing word `Positions`. An example of a `Constructions` class is provided below.

In [132]:
class Bunch(object):
    """Stores variables for shorthand and safe access."""
    def __init__(self, vardict):
        """Initialize variables object with dict."""
        self.dict = vardict
        for k,v in vardict.items():
            setattr(self, k, v)
    def __getattr__(self, name):
        return None
    def __deepcopy__(self, memo=None):
        """Handle deep copy errors by instancing a diff Bunch object."""
        return Bunch(self.dict)
    
class Construction(object):
    """A linguistic construction and its attributes."""
    
    def __init__(self, **specs):
        """Initialize construction item.
        
        **specs:
            name: A name for the construction.
            roles: A dict which maps roles
                to either another Construction item
                or to a Text-Fabric word node.
            cases: A tuple containing condition dicts
                that were evaluated when processing this
                Construction. Key is string containing condition,
                value is Boolean.
            conds: A condition dict containing all of the
                conditions that evaluated to True to validate
                this Construction.
        """
        self.match = specs.get('match', {})
        self.name = specs.get('name', '')
        self.pattern = specs.get('pattern', specs.get('name', ''))
        self.roles = Bunch(specs.get('roles', {}))
        self.inherits = specs.get('inherits', set())
        self.conds = specs.get('conds', {})
        self.cases = specs.get('cases', tuple())
        self.role2slots = collections.defaultdict(set)
        self.slots = set()
        self.mapslots(self.roles.dict)
        self.slots = set(sorted(self.slots)) # sort slots
        self.slots2role = {
            tuple(sorted(slots)):role 
                for role, slots in self.role2slots.items()
        }
        
    def __bool__(self):
        if self.match:
            return True
        else:
            return False
        
    def __str__(self):
        if self:
            return f'CX {self.name} {self.slots}'
        else:
            return ''
        
    def __repr__(self):
        if self:
            return f'CX {self.name} {self.slots}'
        else:
            return ''
            
    def mapslots(self, rolesdict, rolename=None):
        """Recursively map all slots to top embedding role name.

        Match items contain a roles key which can contain
        any number of other match items. This function maps
        all constituent words (Text-Fabric "slots") to their
        top-level linguistic unit (linguistic role).
        """
        for role, item in rolesdict.items():
            if type(item) == Construction:
                self.mapslots(
                    item.roles.dict,
                    rolename=rolename or role
                )
            elif type(item) == int:
                self.role2slots[rolename or role].add(item)
                self.slots.add(item)

    def updaterole(self, role, item):
        """Updates the role."""
        setattr(self.roles, role, item)
        self.roles.dict[role] = item
        
        # update slots 
        
        # delete previous slots2role for this role
        oldslots = tuple(sorted(self.role2slots[role]))
        del self.slots2role[oldslots]
        
        if type(item) == Construction:
            self.slots |= item.slots    
            self.role2slots[role] = item.slots
            self.slots2role[tuple(sorted(item.slots))] = role
        elif type(item) == int:
            self.slots.add(item)
            self.role2slots[role] = item
            self.slots2role[(item,)] = role
            
    def getslotrole(self, slot):
        """Returns the role to which a slot belongs to."""
        for role, slots in self.role2slots.items():
            if slot in slots:
                return role
    
    def unfoldroles(self, cx=None):
        """Return all contained construction roles as a dict.

        Recursively calls down into construction objects to convert
        to role.dict with TF slots.
        """
        cx = cx or self
        roledict={}
        roledict['__cx__'] = cx.name
        for role, item in cx.roles.dict.items():
            if type(item) == Construction :
                roledict[role] = self.unfoldroles(item)
            elif type(item) == int:
                roledict[role] = item
        return roledict

    
class CXbuilder(object):
    """Identifies and builds constructions using Text-Fabric nodes."""
    
    def __init__(self, semsets, tf, **kwargs):
        """Initialize Constructions object.
        
        Arguments:
            semsets: A dictionary containing semantic
                sets. Key should be the name of the set.
                Value is a set of TF nodes.
            tf: An instance of Text-Fabric.
            
        **kwargs:
            context: the context that contains the node in 
            which to run the attribute tests.
        """
        self.tf = tf
        self.F, self.T, self.L = tf.api.F, tf.api.T, tf.api.L
        self.context = kwargs.get('context', 'phrase_atom')
        self.semsets = Bunch(semsets)
        self.cxs = tuple()
    
    def getP(self, node):
        """Get Positions object for a TF node.
        
        Return Dummy object if not node.
        """
        if not node:
            return Dummy()
        return Positions(node, self.context, self.tf).get
    
    def getWk(self, node):
        """Get Walker object for a TF word node.
        
        Return Dummy object if not node.
        """
        if not node:
            return Dummy()
        return Walker(node, self.context, self.tf)
    
    def test(self, *cases):
        """Populate Construction obj based on a cases's all Truth value.
        
        The last-matching case will be used to populate
        a Construction object. This allows more complex
        cases to take precedence over simpler ones.
        
        Args:
            cases: an arbitrary number of dictionaries,
                each of which contains a string key that
                describes the test and a test that evals 
                to a Boolean.
        
        Returns:
            a populated or blank Construction object
        """
        
        # find cases where all cnds == True
        test = [
            case for case in cases
                if all(case['conds'].values())
                    and all(case['roles'].keys())
        ]
        
        # return last test
        if test:
            return Construction(
                match=test[-1],
                cases=cases,
                **test[-1]
            )
        else:
            return Construction(cases=cases, **cases[0])
        
    def findall(self, n):
        """Runs analysis for all constructions with a node.
        
        Returns as dict with test:result as key:value.
        """
        results = []
        for funct in self.cxs:
            cx = funct(n)
            if cx:
                results.append(cx)
        return results

In [133]:
class NounConstructions(CXbuilder):
    """Class for defining noun constructions."""
    
    def __init__(self, wsets, tf):
        
        """Initialize with Constructions attribs/methods."""
        CXbuilder.__init__(self, wsets, tf)
        
        # map cx searches for full analyses
        self.cxs = (
            self.defi,
            self.card_chain,
            self.adjv,
            self.advb,
            self.attrib,
            self.geni,
            self.numb,
            self.prep,
        )

    def defi(self, w):
        """Matches a definite construction."""
        
        P = self.getP(w)
        
        return self.test( 
            {
                'name': 'defi',
                'roles': {'defi': w, 'head': P(1)},
                'conds': {

                    'F.sp.v(w) == art':
                        self.F.sp.v(w) == 'art',

                    'bool(P(1))':
                        bool(P(1))
                }
            }
        )
    
    def prep(self, w):
        """Matches a preposition with a modified element."""
                
        P = self.getP(w)
        
        return self.test(
            {
                'name': 'prep',
                'roles': {'prep':w, 'head': P(1)},
                'conds': {

                    'w in preps':
                        w in self.semsets.preps,

                    'F.prs.v(w) == absent':
                        self.F.prs.v(w) == 'absent',
                    
                    'bool(P(1))':
                        bool(P(1)),
                }
            }
        )
        
    def geni(self, w):
        """Queries for "genitive" relations on a word."""
        
        P = self.getP(w)
        sm = self.semsets
        
        return self.test(
            {
                'name': 'geni',
                'roles': {'geni': P(0), 'head': P(-1)},
                'conds': {

                    'P(-1, st) == c': 
                        P(-1,'st') == 'c',

                    'P(-1) not in quants|preps': 
                        P(-1) not in sm.quants|sm.preps,
                }
            }
        )

    def advb(self, w):
        """Match and adverb and its mod."""
        
        P = self.getP(w)
        
        return self.test(
           {
                'name': 'advb',
                'roles': {'advb': w, 'head': P(1)},
                'conds': {
                    'F.sp.v(w) == advb':
                        self.F.sp.v(w) == 'advb',
                    'P(1) in noms':
                        P(1) in self.semsets.noms,
                }
            }
        )
    
    def adjv(self, w):
        """Matches a word serving as an adjective."""
        
        P = self.getP(w)
        F = self.F
        sm = self.semsets
        
        # check for recursive adjective matches 
        a2match = self.adjv(P(-1)) if P(-1) else Construction()
        a2match_head = a2match.roles.head
        
        common = {
            'P(-1) in noms':
                P(-1) in sm.noms,
            
            'P(-1, st) & {NA, a}': 
                P(-1,'st') in {'NA', 'a'},   
            
            'P(-1) not in {quants|preps}':
                P(-1) not in sm.quants|sm.preps,
        }
                
        tests = (
            
            {
                'name': 'adjv',
                'pattern': 'adjv (1x)',
                'roles': {'adjv':w, 'head': P(-1)},
                'conds': dict(common, **{
                    'F.sp.v(w) in {adjv, verb}':
                        F.sp.v(w) in {'adjv', 'verb'},
                })
            },
            {
                'name': 'adjv',
                'pattern': 'adjv (2x)',
                'roles': {'adjv': P(0), 'head': a2match_head},
                'conds': dict(common, **{
                    
                    'P(0,sp) in {adjv, verb}':
                        P(0,'sp') in {'adjv', 'verb'},
                    
                     'self.adjv(P(-1)) and target != P(0)':
                        bool(a2match) and a2match_head != P(0)
                })
            }
        )

        return self.test(*tests)
     
    def attrib(self, w):
        """Identify elements in a attrib construction.
        
        In Hebrew this construction typically consists of four slots:
            > ה + A + ה + B
        Attrib identifies each of these elements and labels them.
        A is assumed to be the head, or modified, element and B
        is assumed to be an adjectival element.
        """
        
        sm = self.semsets
        
        # CX consists of two constituent cxs
        # start walk from head of first match
        defi1 = self.defi(w)
        d1head = defi1.roles.head
        Wk = self.getWk(d1head)

        # walk to next valid defi match
        # and allow adjectives to intervene:
        defi2 = Wk.ahead(
            lambda n: self.defi(n),
            go=lambda n: self.F.sp.v(n)=='adjv',
            output=True
        ) if Wk else Construction()
        defi2 = defi2 or Construction()
                
        return self.test(
            {
                'name': 'attrib',
                'roles': {'head': defi1, 'attrib': defi2},
                'conds': {
                    'bool(defi1)':
                        bool(defi1),
                    'bool(defi2)':
                        bool(defi2), 
                    'defi2[roles][head] not in quants':
                        defi2.roles.head not in sm.quants
                }
            }
        )
        
    def numb(self, w):
        """Defines numerical relations with an non-quant word
        
        Often but not always indicates quantification as other
        semantic relations are possible.
        """

        P = self.getP(w)
        Wk = self.getWk(w)
        sm = self.semsets
        is_nom = (lambda n: n in sm.noms and n not in sm.quants)
        behind_nom = Wk.back(is_nom, go=lambda n: F.sp.v(n)=='art') 

        return self.test(
        
            {
                'name': 'numb',
                'pattern': 'numbered forward',
                'roles': {'numb': w, 'head': P(1)},
                'conds': {
                    'w in quants':
                        w in sm.quants,
                    
                    'P(1,sp) != conj':
                       P(1,'sp') != 'conj',
                    
                    'P(1) not in quants':
                        P(1) not in sm.quants,
                    
                    'bool(P(1))':
                        bool(P(1))
                },
            },  
            {
                'name': 'numb',
                'pattern': 'numbered backward',
                'roles': {'numb': w, 'head': behind_nom},
                'conds': {
                    
                    'w in quants':
                        w in sm.quants,
                    
                    'not Wk.ahead(is_nominal)':
                        not Wk.ahead(is_nom),
                    
                    'bool(Wk.back(is_nominal))':
                        bool(behind_nom)
                }
            }
        )
        
    def card_chain(self, w):
        """Defines cardinal number chain constructions"""
        
        P = self.getP(w)
        F = self.F
        
        return self.test(
            {
                'name': 'card_chain',
                'pattern': 'adjacent',
                'roles': {'card':w, 'head':P(-1)},
                'conds': {
                    
                    'F.ls.v(w) == card':
                        F.ls.v(w) == 'card',
                    'P(-1,ls) == card':
                        P(-1,'ls') == 'card',                    
                }
            },
            {
                'name': 'card_chain',
                'pattern': 'conjunctive',
                'roles': {'card': w, 'head': P(-2), 'conj': P(-1)},
                'conds': {
                    'F.ls.v(w) == card':
                        F.ls.v(w) == 'card',
                    'P(-1,lex) == W':
                        P(-1,'lex') == 'W',
                    'P(-2,ls) == card':
                        P(-2,'ls') == 'card',   
                }
            }
        )

In [134]:
G = NounConstructions(wsets, A)
test = G.attrib(688)
test2 = G.card_chain(189685)

In [135]:
test2.cases

({'name': 'card_chain',
  'pattern': 'adjacent',
  'roles': {'card': 189685, 'head': 189684},
  'conds': {'F.ls.v(w) == card': True, 'P(-1,ls) == card': False}},
 {'name': 'card_chain',
  'pattern': 'conjunctive',
  'roles': {'card': 189685, 'head': 189683, 'conj': 189684},
  'conds': {'F.ls.v(w) == card': True,
   'P(-1,lex) == W': True,
   'P(-2,ls) == card': True}})

In [136]:
showmatch(test2)

{'__cx__': 'card_chain', 'card': 189685, 'conj': 189684, 'head': 189683}

cnds:
F.ls.v(w) == card                                        True
P(-1,ls) == card                                        False

cnds:
F.ls.v(w) == card                                        True
P(-1,lex) == W                                           True
P(-2,ls) == card                                         True



In [137]:
test.slots2role

{(688, 689): 'head', (690, 691): 'attrib'}

In [138]:
test.roles.dict

{'head': CX defi {688, 689}, 'attrib': CX defi {690, 691}}

In [139]:
test.getslotrole(691)

'attrib'

In [140]:
pprint(test.unfoldroles())

{'__cx__': 'attrib',
 'attrib': {'__cx__': 'defi', 'defi': 690, 'head': 691},
 'head': {'__cx__': 'defi', 'defi': 688, 'head': 689}}


In [141]:
#test_search('cons', name='', show=100, end=50, phrases=timephrases)

## Chunk Constructions

For now we have delimited a set of standard noun constructions. But there are many overlapping or incomplete constructions. For instance, a prepositional construction consisting of the following phrase (space-separated for accuracy):

> ב ה יום

The prepositional construction, as it stands, will only recognize the first word in front of the preposition, yielding the following construction:

> ב ה

The reason is because ה must **first** be united with its remaining elements. We need an **order of operations** to complete this procedure. That is, a list of global priorities that tells the chunker algorithm what parts to merge first. The beginning or ends of overlapping constructions can subsequently be used to link the parts together. For instance, not only do we have a preposition construction, we also have a definite construction which recognizes:

> ה יום

If the definite construction has priority over the preposition construction, the algorithm can gather all of the requisite pieces before placing it in relation to the preposition.

### Set Logic

We can use set logic to determine when constructions overlap in their containing slots. Constructions that are wholly contained in another construction might also need to be abolished. That is the case with the attributive adjective construction for instance:

> ה יום ה שׁבעי

This construction already calls the definite construction as a condition for existence and contains it in its `slots` attribute. In this case, the algorithm should not create two separate instances of the definite constructions.

### Build Constructions

Before we can make the chunks, we need to run the `NounConstructions` builder on the corpus. We do that below and store the resulting `Construction` objects in a phrase2construction mapping. This takes approximately 2 mins on a 2017 Macbook pro.

In [142]:
phrase2cx = collections.defaultdict(list)
G = NounConstructions(wsets, A)

# time it
start = datetime.now()

print(f'{datetime.now()-start} beginning analysis...')

for i, phrase in enumerate(timephrases):
        
    # analyze all known relas
    for w in L.d(phrase, 'word'):
        constructions = G.findall(w)
        if constructions:
            phrase2cx[phrase].extend(constructions)
        
    # report status
    if i % 500 == 0 and i:
        print(f'\t{datetime.now()-start}\tdone with iter {i}/{len(timephrases)}')
        
print(f'{datetime.now()-start}\tCOMPLETE')

0:00:00.000040 beginning analysis...
	0:00:21.207028	done with iter 500/2328
	0:00:41.176612	done with iter 1000/2328
	0:01:00.280601	done with iter 1500/2328
	0:01:17.249739	done with iter 2000/2328
0:01:29.993065	COMPLETE


In [143]:
def sort_cxs(cxlist):
    """Sort constructions based on order of priority."""
    order = ['attrib', 'defi', 'adjv', 'card_chain', 'numb', 'geni', 'prep']
    sort = sorted(
        (order.index(cx.name),) + tuple(cx.slots) + (cx,) for cx in cxlist
    )
    return [item[-1] for item in sort]

def buildclusters(cxlist):
    """Find all overlapping constructions"""
    
    clusters = []
    cxlist = [i for i in cxlist] # protect overwrite
    
    # iterate until no more intersections found
    thiscluster = [cxlist.pop(0)]
    theseslots = set(s for s in thiscluster[0].slots)
    
    # loop continues as it snowballs and picks up slots
    # loop stops when a complete loop produces no other matches
    while cxlist:
        
        matched = False # whether loop was successful
        
        for cx in cxlist:
            if theseslots & cx.slots:
                thiscluster.append(cx)
                theseslots |= cx.slots
                matched = True
        
        cxlist = [cx for cx in cxlist if cx not in thiscluster]
        
        # assemble loop
        if not matched:
            clusters.append(thiscluster)
            thiscluster = [cxlist.pop(0)]
            theseslots = set(s for s in thiscluster[0].slots)
            
    clusters.append(thiscluster)
            
    return clusters

def arrange_cxs(cxlist):
    """Clusters and orders constructions."""
    return [sort_cxs(clist) for clist in buildclusters(cxlist)]

def weaveCX(cxlist, cx=None):
    """Weave together constructions on their intersections."""
    
    # the search is complete, stop here
    if not cxlist:
        return cx
    elif cx is None and len(cxlist) == 1:
        return cxlist[0]
    
    # Copy constructions and operate on copies
    cx1 = cx or copy.deepcopy(cxlist.pop(0))
    cx2 = copy.deepcopy(cxlist.pop(0))
    
    # get first slot of intersection between cx1 and 2
    # get that slot's role in both cxs
    link = next(iter(cx1.slots & cx2.slots), None)
    link_cx1role = cx1.getslotrole(link)
    link_cx2role = cx2.getslotrole(link)
    
    # connect broken links
    if not link:
        sys.stderr.write(f'No explicit link between {cx1} and {cx2}\n')
        cx2 = weaveCX(cxlist, cx2) # compile the rest of the cx
        sys.stderr.write(f'\tattempt connect {cx1} and {cx2}')
        return weaveCX([cx2], cx1) # attempt to connect cx1 and 2 again
        
    # discard cx2 if fully covered by cx1
    if cx1.slots.issuperset(cx2.slots):
        return weaveCX(cxlist, cx1)
    
    # replace role in cx1 with cx2
    elif (
        link_cx1role != 'head'
        and link_cx2role == 'head'
        and not (cx1.slots - cx2.slots) # expand not shrink
    ):
        cx1.updaterole(link_cx1role, cx2)
        return weaveCX(cxlist, cx1)
        
    # replace the role in cx2 with cx1
    elif (
        link_cx1role != 'head'
        and link_cx2role != 'head' or (cx1.slots - cx2.slots)
    ):
        cx2.updaterole(link_cx2role, cx1)
        return weaveCX(cxlist, cx2)
    
def getcxs(phrase_atom):
    """Gather constructions mapped to phrase atoms."""
    cxlists = arrange_cxs(phrase2cx[phrase_atom])
    for cxlist in cxlists:
        yield weaveCX(cxlist)

In [144]:
dog = list(getcxs(974016))[0]

No explicit link between CX attrib {115232, 115233, 115230, 115231} and CX prep {115228, 115229}
	attempt connect CX attrib {115232, 115233, 115230, 115231} and CX prep {115228, 115229, 115230}

In [145]:
showmatch(dog)

{   '__cx__': 'prep',
    'head': {   '__cx__': 'attrib',
                'attrib': {'__cx__': 'defi', 'defi': 115232, 'head': 115233},
                'head': {'__cx__': 'defi', 'defi': 115230, 'head': 115231}},
    'prep': {'__cx__': 'prep', 'head': 115229, 'prep': 115228}}

cnds:
w in preps                                               True
F.prs.v(w) == absent                                     True
bool(P(1))                                               True



<br>
<br>
<br>
<hr>

## Test Cases

In [148]:
A.pretty(1020587)

In [149]:
A.pretty(906959)

In [150]:
A.pretty(905154)

In [151]:
A.pretty(974016)

## Testing Construction Builds

In [159]:
shuff = [ph for ph in timephrases 
            if len(L.d(ph, 'word')) > 2
        ]

random.shuffle(shuff)

In [160]:
for phrase in shuff[:25]:
    print(phrase)
    print('-'*10)
    for cx in getcxs(phrase):
        showmatch(cx)
    print()

1133203
----------


{   '__cx__': 'prep',
    'head': {'__cx__': 'defi', 'defi': 358030, 'head': 358031},
    'prep': 358029}

cnds:
w in preps                                               True
F.prs.v(w) == absent                                     True
bool(P(1))                                               True




950423
----------


{   '__cx__': 'prep',
    'head': {   '__cx__': 'attrib',
                'attrib': {'__cx__': 'defi', 'defi': 75792, 'head': 75793},
                'head': {'__cx__': 'defi', 'defi': 75790, 'head': 75791}},
    'prep': 75789}

cnds:
w in preps                                               True
F.prs.v(w) == absent                                     True
bool(P(1))                                               True




1092048
----------


{   '__cx__': 'prep',
    'head': {   '__cx__': 'geni',
                'geni': {'__cx__': 'defi', 'defi': 300294, 'head': 300295},
                'head': 300293},
    'prep': 300292}

cnds:
w in preps                                               True
F.prs.v(w) == absent                                     True
bool(P(1))                                               True




1036250
----------


{   '__cx__': 'prep',
    'head': {'__cx__': 'geni', 'geni': 214246, 'head': 214245},
    'prep': 214244}

cnds:
w in preps                                               True
F.prs.v(w) == absent                                     True
bool(P(1))                                               True




997710
----------


{   '__cx__': 'prep',
    'head': {'__cx__': 'defi', 'defi': 153680, 'head': 153681},
    'prep': 153679}

cnds:
w in preps                                               True
F.prs.v(w) == absent                                     True
bool(P(1))                                               True




1059611
----------


{   '__cx__': 'prep',
    'head': {   '__cx__': 'attrib',
                'attrib': {'__cx__': 'defi', 'defi': 249056, 'head': 249057},
                'head': {'__cx__': 'defi', 'defi': 249054, 'head': 249055}},
    'prep': 249053}

cnds:
w in preps                                               True
F.prs.v(w) == absent                                     True
bool(P(1))                                               True




1066812
----------


{   '__cx__': 'prep',
    'head': {   '__cx__': 'attrib',
                'attrib': {'__cx__': 'defi', 'defi': 261009, 'head': 261010},
                'head': {'__cx__': 'defi', 'defi': 261007, 'head': 261008}},
    'prep': 261006}

cnds:
w in preps                                               True
F.prs.v(w) == absent                                     True
bool(P(1))                                               True




929876
----------


{   '__cx__': 'prep',
    'head': {'__cx__': 'numb', 'head': 39026, 'numb': 39025},
    'prep': 39024}

cnds:
w in preps                                               True
F.prs.v(w) == absent                                     True
bool(P(1))                                               True




1140061
----------


{   '__cx__': 'prep',
    'head': {   '__cx__': 'attrib',
                'attrib': {'__cx__': 'defi', 'defi': 368883, 'head': 368884},
                'head': {'__cx__': 'defi', 'defi': 368881, 'head': 368882}},
    'prep': 368880}

cnds:
w in preps                                               True
F.prs.v(w) == absent                                     True
bool(P(1))                                               True




1084522
----------


{   '__cx__': 'prep',
    'head': {'__cx__': 'defi', 'defi': 289003, 'head': 289004},
    'prep': 289002}

cnds:
w in preps                                               True
F.prs.v(w) == absent                                     True
bool(P(1))                                               True




931538
----------


{   '__cx__': 'prep',
    'head': {'__cx__': 'numb', 'head': 41522, 'numb': 41523},
    'prep': 41521}

cnds:
w in preps                                               True
F.prs.v(w) == absent                                     True
bool(P(1))                                               True




1103846
----------


{   '__cx__': 'prep',
    'head': {'__cx__': 'geni', 'geni': 317635, 'head': 317634},
    'prep': 317633}

cnds:
w in preps                                               True
F.prs.v(w) == absent                                     True
bool(P(1))                                               True




1002462
----------


{   '__cx__': 'prep',
    'head': {'__cx__': 'defi', 'defi': 160705, 'head': 160706},
    'prep': 160704}

cnds:
w in preps                                               True
F.prs.v(w) == absent                                     True
bool(P(1))                                               True




1169910
----------


{   '__cx__': 'prep',
    'head': {'__cx__': 'card_chain', 'card': 422187, 'head': 422186},
    'prep': 422185}

cnds:
w in preps                                               True
F.prs.v(w) == absent                                     True
bool(P(1))                                               True




1059633
----------


{   '__cx__': 'prep',
    'head': 249112,
    'prep': {'__cx__': 'prep', 'head': 249111, 'prep': 249110}}

cnds:
w in preps                                               True
F.prs.v(w) == absent                                     True
bool(P(1))                                               True




906211
----------


{   '__cx__': 'numb',
    'head': 2374,
    'numb': {'__cx__': 'card_chain', 'card': 2373, 'conj': 2372, 'head': 2371}}

cnds:
w in quants                                              True
P(1,sp) != conj                                          True
P(1) not in quants                                       True
bool(P(1))                                               True

cnds:
w in quants                                              True
not Wk.ahead(is_nominal)                                False
bool(Wk.back(is_nominal))                               False



{'__cx__': 'numb', 'head': 2377, 'numb': 2376}

cnds:
w in quants                                              True
P(1,sp) != conj                                          True
P(1) not in quants                                       True
bool(P(1))                                               True

cnds:
w in quants                                              True
not Wk.ahead(is_nominal)                                False
bool(Wk.back(is_nominal))                               False




928376
----------


{   '__cx__': 'numb',
    'head': {'__cx__': 'defi', 'defi': 36667, 'head': 36668},
    'numb': 36666}

cnds:
w in quants                                              True
P(1,sp) != conj                                          True
P(1) not in quants                                       True
bool(P(1))                                               True

cnds:
w in quants                                              True
not Wk.ahead(is_nominal)                                False
bool(Wk.back(is_nominal))                               False




914183
----------


{   '__cx__': 'prep',
    'head': {'__cx__': 'defi', 'defi': 14885, 'head': 14886},
    'prep': 14884}

cnds:
w in preps                                               True
F.prs.v(w) == absent                                     True
bool(P(1))                                               True




949306
----------


{   '__cx__': 'geni',
    'geni': 73522,
    'head': {'__cx__': 'numb', 'head': 73521, 'numb': 73520}}

cnds:
P(-1, st) == c                                           True
P(-1) not in quants|preps                                True




1097653
----------


{   '__cx__': 'prep',
    'head': {   '__cx__': 'attrib',
                'attrib': {'__cx__': 'defi', 'defi': 308985, 'head': 308986},
                'head': {'__cx__': 'defi', 'defi': 308983, 'head': 308984}},
    'prep': 308982}

cnds:
w in preps                                               True
F.prs.v(w) == absent                                     True
bool(P(1))                                               True




909510
----------


{   '__cx__': 'prep',
    'head': {   '__cx__': 'adjv',
                'adjv': 8056,
                'head': {'__cx__': 'defi', 'defi': 8054, 'head': 8055}},
    'prep': 8053}

cnds:
w in preps                                               True
F.prs.v(w) == absent                                     True
bool(P(1))                                               True




984883
----------


{   '__cx__': 'prep',
    'head': {   '__cx__': 'attrib',
                'attrib': {'__cx__': 'defi', 'defi': 134168, 'head': 134169},
                'head': {'__cx__': 'defi', 'defi': 134166, 'head': 134167}},
    'prep': 134165}

cnds:
w in preps                                               True
F.prs.v(w) == absent                                     True
bool(P(1))                                               True




1129324
----------


{   '__cx__': 'numb',
    'head': {'__cx__': 'defi', 'defi': 352652, 'head': 352653},
    'numb': 352651}

cnds:
w in quants                                              True
P(1,sp) != conj                                          True
P(1) not in quants                                       True
bool(P(1))                                               True

cnds:
w in quants                                              True
not Wk.ahead(is_nominal)                                False
bool(Wk.back(is_nominal))                               False




1051694
----------


{'__cx__': 'prep', 'head': 236534, 'prep': 236533}

cnds:
w in preps                                               True
F.prs.v(w) == absent                                     True
bool(P(1))                                               True



{   '__cx__': 'prep',
    'head': {   '__cx__': 'attrib',
                'attrib': {'__cx__': 'defi', 'defi': 236539, 'head': 236540},
                'head': {'__cx__': 'defi', 'defi': 236537, 'head': 236538}},
    'prep': 236536}

cnds:
w in preps                                               True
F.prs.v(w) == absent                                     True
bool(P(1))                                               True




964164
----------


{   '__cx__': 'prep',
    'head': {   '__cx__': 'attrib',
                'attrib': {'__cx__': 'defi', 'defi': 99118, 'head': 99119},
                'head': {'__cx__': 'defi', 'defi': 99116, 'head': 99117}},
    'prep': 99115}

cnds:
w in preps                                               True
F.prs.v(w) == absent                                     True
bool(P(1))                                               True




