# Composing Time Constructions

In this notebook we build and test a Hebrew phrase parser.

In [2]:
import sys
import collections
import pickle
import random
import re
import copy
import numpy as np
import networkx as nx
from datetime import datetime
import matplotlib.pyplot as plt
from Levenshtein import distance as lev_dist
from pprint import pprint

# local packages
from tf_tools.load import load_tf

# load semantic vectors
from paths import semvector
with open(semvector, 'rb') as infile: 
    semdist = pickle.load(infile)

# load and configure Text-Fabric
TF, api, A = load_tf()
F, E, T, L = api.F, api.E, api.T, api.L
A.displaySetup(condenseType='phrase', withNodes=True, extraFeatures='st')

This is Text-Fabric 7.9.0
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

122 features found and 5 ignored
  0.00s loading features ...
   |     0.00s No structure info in otext, the structure part of the T-API cannot be used
  5.68s All features loaded/computed - for details use loadLog()


In [38]:
# add grammar to path
sys.path.append('../cxs')

# Machinery

We could use some machinery to do the hard work of looking in and around a node. In the older approach we used TF search templates. But these are not very efficient at scale, and they are always bound by the limits of the query language. I take another approach here: a set of classes that specify locations and directions within a specified context.

In [3]:
from positions import Positions, PositionsTF, Walker, Dummy

## `Positions(TF)`

The `Positions` class enables concise access to adjacent nodes within a given context. This allows us to write algorithms with query-like efficiency with all of the power of Python. 

This class is instantiated on a word node and can provide contextual look-up data for a given word. For example, given a phrase containing the following word nodes:

> (189681, 189682, **189683**, 189684, 189685, 189686) <br>

representing the following phrase (space separated for clarity):

> ב שׁנת **שׁלשׁים** ו שׁמנה שׁנה

Given that the bolded node, `189683` is our `source` word, we instantiate the class, feeding in the node, the "phrase_atom" string (which is the context we want to search within), and an instance of Text-Fabric (`tf`):

In [4]:
      #    source node    context  TF instance  
      #         |            |       |
P = PositionsTF(189683, 'phrase_atom', A).get

If we want to obtain the word adjacent one space forward, we simply ask `P` for `1`, which gives us the next word in the phrase.

In [5]:
P(1)

189684

If we try to ask for 4 words forward, we go beyond the bounds of the phrase. But `P` handles this by returning nothing:

In [6]:
P(4)

To look back one word, we simply give a negative value:

In [7]:
P(-1)

189682

Finally, `P` can be used to quickly call features on these words. For instance, in order to get the lexeme of the word two words in front of `189683`:

In [8]:
P(2,'lex')

'CMNH/'

And if we want to get a number of features, we can just add other features to the arguments. The result is a feature set:

In [9]:
P(2, 'lex', 'nu')

{'CMNH/', 'sg'}

`P` can also handle features on the source node itself by giving a positionality of `0`:

In [10]:
P(0, 'lex')

'CLC/'

### `Positions` also exists in a non-TF version

When the non-tf version of `Positions` is provided any iterable, it can perform the same functions.

In [11]:
test_ps = ['The', 'good', 'dog', 'jumped.']

P = Positions('good', test_ps).get

In [12]:
P(1)

'dog'

Positions can perform a function on the result with an option `do`. In the example below, the word two words ahead is found and an upper-case function is called on the string.

In [13]:
P(2, do=lambda w: w.upper())

'JUMPED.'

The non-tf version of `Positions` makes it possible to do positionality searches with any ordered list of Python objects that represent linguistic units.

## `Walker`

`Walker` performs a similar function to `Positions`, except it is ambiguous to exact positions, walking either `ahead` or `back` from the source to a target node in the context. A function must be supplied that returns `True` on the target node.

We instantiate the `Walker` using the same source and context as above.

In [14]:
source = 189683
# get words inside source's phrase_atom
positions = L.d(
    L.u(189683,'phrase_atom')[0], 'word'
)

Wk = Walker(source, positions)

`Walker` is demonstrated below with the same word. A simple `lambda` function is used to test for the lexeme. In the example below, we find the first word ahead of `189683` that is a cardinal number:

In [15]:
Wk.ahead(lambda w: F.ls.v(w) == 'card')

189685

An alternative demonstrates the `None` returned on the lack of a valid match.

In [16]:
Wk.ahead(lambda w: F.ls.v(w) == 'BOOGABOOGA')

Another example wherein we walk backwards to the preposition:

In [17]:
Wk.back(lambda w: F.sp.v(w) == 'prep')

189681

We can also specify that the walk should be interrupted under certain conditions with a `stop` function. In this case we walk forward to the next cardinal number, but the walk is interrupted when the `stop` function detects a conjunction.

In [18]:
Wk.ahead(lambda w: F.ls.v(w) == 'card',
         stop=lambda w: F.sp.v(w) == 'conj')

We can also specify the opposite with a `go` function argument, which defines the nodes that allowed to intervene between `source` and `target`. Below we specify that *only* a conjunction should intervene.

In [19]:
Wk.ahead(lambda w: F.ls.v(w) == 'card',
         go=lambda w: F.sp.v(w) == 'conj')

189685

The `go` and `stop` functions can be as permissive or strict as desired.

Finally, we can tell `Walker` that the output of the validation function should be returned instead of the node itself with the optional argument `output=True`:

In [20]:
val_funct = lambda w: F.ls.v(w) if F.ls.v(w)=='card' else None

Wk.ahead(val_funct, output=True)

'card'

This ability is useful for certain tests.

Like `Positions`, `Walker` can be used in non-TF contexts:

In [21]:
test_ps = ['The', 'bad', 'cat', 'swatted.']

Wk_notf = Walker('bad', test_ps)

In [22]:
Wk_notf.ahead(lambda w: w.startswith('sw'))

'swatted.'

### Returning All Results along Path

`Walker` can also return all results along the path by toggling `every=True`

In [23]:
Wk_notf.ahead(lambda w: type(w)==str, every=True)

['cat', 'swatted.']

## `Dummy`

When writing conditions and logic, we want an object that passively receives `NoneType`s or zero `int`s without throwing errors. Such an object should also return `None` to reflect its `False` value. `Dummy`, provides such functionality. `Dummy` can receive all of the arguments, kwargs, and function calls as a `Positions` or `Walker` object. But it returns absolutely nothing. Ouch.

In [24]:
D = Dummy(None, 'phrase_atom', A)

The function call below returns `None`:

In [25]:
D.get(1)

As does this:

In [26]:
D.get(1, 'lex')

And even this:

In [27]:
D.ahead(1)

`D` is essentially a souless void that consumes whatever you throw at it and gives nothing in return.

For safe-calls on a `Position` or `Walker` object, assign nodes to it via a function with a `Dummy` given on null nodes:

In [28]:
def getPos(node, context, tf):
    """A function to get Positions safely."""
    if node:
        return PositionsTF(node, context, tf)
    else:
        return Dummy() # <- give dummy on empty node

So:

In [29]:
P = getPos(None, 'phrase_atom', A)
P.get(1)

Or:

In [30]:
P = getPos(1, 'phrase_atom', A)
P.get(1)

2

# Need for Semantic Data

The accurate processing of word connections depends on fuller semantic data than BHSA provides. Future semantic data could be stored in a similar way to word sets (`wsets`). 

For example, in the two phrases

> (Exod 25:39) ככר זהב טהור <br>
> (2 Sam 24:24) בכסף שקלים חמשׁים

we see that זהב and כסף, despite being in two different positions with two different words indicates a kind of "composed of" semantic concept: "round gold" (i.e. round composed of gold) and "silver shekels" (shekels composed of silver). To process these kinds of links, we need a list of nouns that often function as "material." But this is only the beginning. Many other words will have specific semantic values that motivate their syntactic behavior. Such a scope lies outside the bounds of this author's current project on Hebrew time phrases.

## A Compromise: Time Phrases

Since constructing these semantic classes is vastly time consuming, I want to start with a smaller set of cases. I will instead focus on parsing connections within time phrases for now. This is because I am analyzing time phrases in my current ongoing PhD project. 

In [39]:
def disjoint(ph):
    """Isolate phrases with gaps."""
    ph = L.d(ph,'word')
    for w in ph:
        if ph[-1] == w:
            break
        elif (ph[ph.index(w)+1] - w) > 1:
            return True

In [40]:
alltimes = [
    ph for ph in F.otype.s('timephrase') 
]
    
timephrases = [ph for ph in alltimes if not disjoint(ph)]

print(f'{len(timephrases)} phrases ready')

4438 phrases ready


## Search & Display Functions

The functions below allow for fast searching and displaying of queries using a `Construction` object, described in the next section.

In [41]:
from cx_analysis.search import SearchCX

In [42]:
cx_show = SearchCX(A)
pretty, prettyconds, showcx, search = (
    cx_show.pretty, cx_show.prettyconds, 
    cx_show.showcx, cx_show.search
)

## Construction Classes

* `Construction` - an object that represents a linguistic construction; the class records roles and the words that occupy them, as well as has methods for accessing and retrieving data on embedded roles/other constructions
* `CXBuilder` - matches conditions to build `Construction` objects; populates them with requisite data

In [43]:
from cx_analysis.cx import Construction
from cx_analysis.build import CXbuilder, CXbuilderTF

# Word Constructions

The `wordConstructions` builder class recognizes word semantic classes and types based on provided criteria.

In [45]:
from word_grammar import Words

<hr>

# Subphrase Constructions

The `SPConstructions` class prepares subphrase constructions.

In [56]:
from phrase_grammar import Subphrases

### Load Constructions

In [57]:
words = Words(A) # word CX builder

# analyze all matches; return as dict
start = datetime.now()
print(f'Beginning word construction analysis...')
wordcxs = words.cxdict(
    s for tp in timephrases
        for s in L.d(tp,'word')
)
print(f'\t{datetime.now() - start} COMPLETE \t[ {len(wordcxs)} ] words loaded')

Beginning word construction analysis...
	0:00:07.322392 COMPLETE 	[ 13579 ] words loaded


In [58]:
# time phrase CX builder
spc = Subphrases(wordcxs, semdist, A)

# TO-FIX

* missed appo 361457 cx: 1450112 (בחים מספר ימי חיי הבלו)

### Small Tests

In [59]:
# pretty(1448320)

In [60]:
# test_small = spc.appo_name(202679)
# showcx(test_small, conds=True)

### Stretch Tests

In [61]:
# On deck: adjectival preposition
# check performance: 1448556

test = spc.analyzestretch(L.d(1450075, 'word'), debug=False)

# for res in test:
#     showcx(res, conds=True)

### Pattern Searches

In [62]:
# words = [w for ph in timephrases for w in L.d(ph, 'word')]

# results = search(words, spc.appo_name, pattern='entity_name', show=100, shuffle=False)


### Analyze Results

In [63]:
# for res in results:
#     head, appo = list(res.getsuccroles('head'))[-1], list(res.getsuccroles('appo'))[-1]
#     hlex, alex = F.lex.v(int(head)), F.lex.v(int(appo))
    
#     showcx(res)
#     print()
#     print(f'lexs: {hlex} x {alex}')
#     print(f'dist: {semdist[hlex][alex]}')
#     print()

### Stretch Tests on Results

In [64]:
# elements = sorted(set(L.u(res.element, 'timephrase')[0] for res in results))

# for el in elements:
    
#     stretch = L.d(el, 'word')
#     test = spc.analyzestretch(stretch)
    
#     for res in test:
#         showcx(res)

### Testing on Random Phrases

In [65]:
# shuff = [k for k in timephrases
#             if len(L.d(k,'word')) > 4]
# random.shuffle(shuff)

In [66]:
# for phrase in shuff[:25]:
    
#     print('analyzing', phrase)
#     elements = L.d(phrase,'word')
    
#     try:
#         cxs = tpc.analyzestretch(elements)
#         if cxs:
#             for cx in cxs:
#                 showcx(cx, refslots=elements)
#         else:
#             showcx(Construction(), refslots=elements)
    
#     except:
#         sys.stderr.write(f'\nFAIL...running with debug...\n')
#         pretty(phrase)
#         tpc.analyzestretch(elements, debug=True)
#         raise Exception('...debug complete...')

### Testing on All Timephrases

In [67]:
phrase2cxs = collections.defaultdict(list)
nocxs = []

# time it
start = datetime.now()

print(f'{datetime.now()-start} beginning subphrase analysis...')

for i, phrase in enumerate(timephrases):
     
    # analyze all known relas
    elements = L.d(phrase,'word')
    
    # analyze with debug exceptions
    try:
        cxs = spc.analyzestretch(elements)
    except:
        sys.stderr.write(f'\nFAIL...running with debug...\n')
        pretty(phrase)
        spc.analyzestretch(elements, debug=True)
        raise Exception('...debug complete...')

    # save those phrases that have no matching constructions
    if not cxs:
        nocxs.append(phrase)
    else:
        phrase2cxs[phrase] = cxs
        
    # report status
    if i % 500 == 0 and i:
        print(f'\t{datetime.now()-start}\tdone with iter {i}/{len(timephrases)}')
        
print(f'{datetime.now()-start}\tCOMPLETE')
print('-'*20)
print(f'{len(phrase2cxs)} phrases matched with Constructions...')
print(f'{len(nocxs)} phrases not yet matched with Constructions...')

0:00:00.000046 beginning subphrase analysis...
	0:00:12.153777	done with iter 500/4438
	0:00:24.698732	done with iter 1000/4438
	0:00:35.234626	done with iter 1500/4438
	0:00:44.588046	done with iter 2000/4438
	0:00:58.892290	done with iter 2500/4438
	0:01:12.228788	done with iter 3000/4438
	0:01:24.128717	done with iter 3500/4438
	0:01:35.129935	done with iter 4000/4438
0:01:48.819077	COMPLETE
--------------------
4438 phrases matched with Constructions...
0 phrases not yet matched with Constructions...


## Closing Gaps

### Identify Gaps

Find timephrases that contain un-covered words besides waw conjunctions.

In [48]:
# gapped = []
# tested = []

# for ph, cxs in phrase2cxs.items():
    
#     tested.append(ph)
    
#     ph_slots = set(
#         s for s in L.d(ph,'word')
#     )
#     cx_slots = set(
#         s for cx in cxs
#             for s in cx.slots
#     )
    
#     if ph_slots.difference(cx_slots):
#         gapped.append(cxs)
        
# print(f'{len(gapped)} gapped phrases logged...')

In [49]:
# for gp in gapped[:25]:
#     for cx in gp:
#         showcx(cx)

<hr>

# Phrase Constructions

Developing a CXbuilder to connect all constructions in a complete phrase.


## Ambiguity with Coordinate CXs

Considerable ambiguity is present in several coordinate constructions:

**`A B and C`**<br>
Given A, B, C == nominal words. Is their relationship `A // B // C` or `A+B // C`. In other words: **what is the relationship of two adjacent nominal words given a list?** Is B a descriptor of A or is it an independent element? 

**`A of B and C`**<br>
Is it, `(A of B) // (C)` or `(A of (B // C)`

Or even:

**`A of B C and D`**<br>
This pattern combines elements from both ambiguous cases.

### Method

To address these ambiguities we will apply a battery of disambiguation attempts. At the core of these attempts is a [Semantic Vector Space](https://en.wikipedia.org/wiki/Vector_space_model), which is able to quantify the semantic distance between two words based on their contextual uses throughout the Hebrew Bible.

The working hypothesis of this method is
> Words in coordination with each other will be more semantically similar (i.e. the least distance in the vector space) than other candidates in the phrase.

Semantic similarity in a vector space is not the only method used, however. Another aspect of semantic closeness is phrase structure. For instance, the identity of phrase types is taken into consideration above semantic similarity. 

In [110]:
import cx_analysis.graph_nav as nav

In [112]:
class Phrases(CXbuilder):
    """Build complete phrase constructions."""
    
    def __init__(self, phrase2cxs, semdist, tf):
        CXbuilder.__init__(self)
        
        # set up tf methods
        self.tf = tf
        self.F, self.T, self.L = tf.api.F, tf.api.T, tf.api.L
        
        # map cx to phrase node for context retrieval
        self.cx2phrase = {
            cx:ph 
                for ph in phrase2cxs
                    for cx in phrase2cxs[ph]
        }
        
        self.phrase2cxs = phrase2cxs
        self.semdists = semdist
        
        self.cxs = (        
            self.appo,
            self.coord,
        )
        self.dripbucket = (
            self.cxph,
        )
        
        self.kind = 'phrase'
        
    def cxph(self, cx):
        """Dripbucket function that returns cx as is."""
        return cx
        
    def get_context(self, cx):
        """Get context for a given cx."""
        phrase = self.cx2phrase.get(cx, None)
        if phrase:
            return self.phrase2cxs[phrase]
        else:
            return tuple()
        
    def getP(self, cx):
        """Index positions on phrase context"""
        positions = self.get_context(cx)
        if positions:
            return Positions(
                cx, positions, default=Construction()
            ).get
        else:
            return Dummy

    def getWk(self, cx):
        """Index walks on phrase context"""
        positions = self.get_context(cx)
        if positions:
            return Walker(cx, positions)
        else:
            return Dummy()
    
    def getindex(
        self, indexable, index, 
        default=Construction()
    ):
        """Safe index on iterables w/out IndexErrors."""
        try:
            return indexable[index]
        except:
            return default
    
    def getname(self, cx):
        """Get a cx name"""
        return cx.name
    
    def getkind(self, cx):
        """Get a cx kind."""
        return cx.kind
    
    def getsuccrole(self, cx, role, index=-1):
        """Get a cx role from a list of successive roles.
        
        e.g.
        [big_head, medium_head, small_head][-1] == small_head
        """
        cands = list(cx.getsuccroles(role))
        try:
            return cands[index]
        except IndexError:
            return Construction()
    
    def string_plus(self, cx, plus=1):
        """Stringifies a CX + N-slots for Levenshtein tests."""
        
        # get all slots in the context for plussing
        allslots = sorted(set(
            s for scx in self.get_context(cx)
                for s in scx.slots
        ))
        
        # get plus slots
        P = (Positions(self.getindex(cx.slots, -1), allslots).get
                 if cx.slots and allslots else Dummy)
        plusses = []
        for i in range(plus, plus+1):
            plusses.append(P(i,-1)) # -1 for null slots (== empty string in T.text)
        plusses = [p for p in plusses if type(p) == int]
        
        # format the text string for Levenshtein testing
        ptxt = T.text(
            cx.slots + tuple(plusses),
            fmt='text-orig-plain'
        ) if cx.slots else ''
        
        return ptxt

    def rank_candidates(self, cx, cx_patterns=[]):
        """Ranks preceding phrases on likelihood of a relationship
        
        TODO: Give a thorough explanation
        """
        
        # standard features and positional navigation
        F, T = self.F, self.T
        P = self.getP(cx)
        semdist = self.semdists
        Wk = self.getWk(cx)             
            
        # first we need to collect candidates
        # there are two possibilities:
        #    1. non-embedded candidates (i.e. have no other relations)
        #    2. embedded candidates (i.e. already part of another subphrase)
        # we give first preference to top level candidates as this seems 
        # to produce more accurate results
        
        # 1. get all top-level cxs behind this one that match in name
        cx_behinds = Wk.back(
            lambda c: c.name == cx.name,
            every=True,
            stop=lambda c: (
                c.name == 'conj' and (c != P(-1))
            )
        )
        
        # 2. if top level phrases produce no results,
        #       look for embedded candidates
        if not cx_behinds:
            topcontext = self.get_context(cx)
            
            # gather all valid embedded candidates
            subcontext = []
            for topcx in topcontext:
                for subcx in topcx.subgraph():
                    if type(subcx) == int: # skip TF slots
                        continue
                    if (
                        subcx in topcontext or subcx.name != 'conj'
                        and subcx not in cx
                    ):
                        subcontext.append(subcx)        
            
            # walk the embedded candidates
            # and collect those that are valid
            Wk2 = Walker(cx, subcontext)
            cx_behinds = Wk2.back(
                lambda c: c.name != 'conj', 
                default=[P(-2)],
                every=True,
                stop=lambda c: (
                    c.name == 'conj' and (c != P(-1))
                )
            )
        
        # Now we apply a series of additional filters on the candidates:
        
        # map each candidate to its last slot to make sure
        # every one is the last item in its phrase
        # (check is made in next series of lines)
        cx2last = {
            cxb:self.getindex(sorted(cxb.slots), -1, 0)
                for cxb in cx_behinds
        }
        
        # find coordinate candidate subphrases that stand
        # at the end of the phrase
        cx_subphrases = []
        for cx_back in cx_behinds:
            for cxsp in cx_back.subgraph():
                if type(cxsp) == int:
                    continue
                elif (
                    cx2last[cx_back] in cxsp.slots # check last slot
                    and cxsp.getrole('head')
                ):
                    cx_subphrases.append(cxsp)
        
        # get subphrase heads for semantic tests
        cx2heads = [
            (cxsp, self.getsuccrole(cxsp,'head'))
                for cxsp in cx_behinds
        ]

        # get head of this cx
        head1 = self.getsuccrole(cx,'head')     
        head1lex = F.lex.v(head1)
        
        # sort on a set of priorities
        # the default sort behavior is used (least to greatest)
        # thus when a bigger value should be more important, 
        # a negative is added to the number
        stringp = self.string_plus
        
        # arrange candidates by priority
        cxpriority = []
        for cxsp, headsp in cx2heads:
            name_eq = 0 if cxsp.name == cx.name else 1
            semantic_dist = semdist.get(
                head1lex,{}
            ).get(F.lex.v(headsp), np.inf)
            size = -len(cxsp.slots)
            levenshtein = lev_dist(stringp(cx), stringp(cxsp))
            slot_dist = -next(iter(cxsp.slots), 0)
            heads = (head1, headsp) # for reporting purposes only
            
            cxpriority.append((
                name_eq,
                semantic_dist,
                size,
                levenshtein,
                slot_dist,
                heads,
                cxsp
            ))
            
        # make the sorting
        candidates = sorted(cxpriority, key=lambda k: k[:-1])
        
        # select the first priority candidate
        cand = next(iter(candidates), (0,0,Construction()))
        
        # add data for conds report / debugging
        stats = collections.defaultdict(str)
        for namescore,sdist,leng,ldist,lslot,heads,cxp in candidates:
            # name equality
            stats['namescore'] += f'\n\t{cxp} namescore: {namescore}'
            # semantic distance
            stats['semdists'] += (
                f'\n\t{round(sdist, 2)}, {F.lex.v(heads[0])} ~ {F.lex.v(heads[1])}, {cxp}'
            )
            # size of cx
            stats['size'] += f'\n\t{cxp} length: {abs(leng)}'
            
            # Levenstein distance
            stats['ldist'] += f'\n\t{cxp} dist: {ldist}'
            
            # dist of last slot
            stats['lslot'] += f'\n\t{cxp} last slot: {abs(lslot)}'
    
        return (candidates, cand, stats)
    
    def coord(self, cx):
        """A coordinate construction.
        
        In order to match a coordinate cx, we need to determine
        which item in the previous phrase this cx belongs with. 
        This is done using a semantic vector space, which can
        quantify the approximate semantic distance between the
        heads of this cx and a candidate cx.
        
        Criteria utilized in validating a coordinate cx between
        an origin cx and a candidate cx are the following:
            TODO: fill in
        """
        
        P = self.getP(cx)        
        cands, cand, stats = self.rank_candidates(cx)
        
        return self.test(
            {
                'element': cx,
                'name': 'coord',
                'kind': self.kind,
                'roles': {'part2':cx, 'conj': P(-1), 'part1': cand[-1]},
                'conds': {
                    'P(-1).name == conj':
                        P(-1).name == 'conj',
                    'bool(cand)':
                        bool(cand[-1]),
                    f'name matches {stats["namescore"]}\n':
                        bool(cands),
                    f'is shortest sem. distance of {stats["semdists"]}\n':
                        bool(cands),
                    f'is longest length of: {stats["size"]}\n':
                        bool(cands),
                    f'is shortest Levenshtein distance: {stats["ldist"]}\n':
                        bool(cands),
                    f'is closest last slot of: {stats["lslot"]}\n':
                        bool(cands)
                }
            },
        )
    
    def L_anchor(self, cx):
        """Find L anchor CXs"""
        P = self.getP(cx)
        prep = nav.get_role(cx, 'prep', default=Construction())
        prep = next(iter(prep.slots), 0)
        prep_lex = self.F.lex.v(prep)
        return self.test(
            {
                'element': cx,
                'name': 'L_anchor',
                'kind': self.kind,
                'roles': {'anchor': cx, 'head': P(-1)},
                'conds': {
                    'prep_lex == L':
                        prep_lex == 'L',
                    'bool(P-1)':
                        bool(P(-1)),
                },
            },
        )   
    
    def appo(self, cx):
        """Find appositional cxs"""
        
        P = self.getP(cx)
        cands, cand, stats = self.rank_candidates(cx)
                
        return self.test(
            {
                'element': cx,
                'name': 'appo',
                'pattern': 'NP',
                'kind': self.kind,
                'roles': {'appo':cx, 'head': cand[-1]},
                'conds': {
                    'name(cx) not in not_NPset':
                        cx.name not in {'prep_ph','conj'},
                    'P(-1).name != conj':
                        P(-1).name != 'conj',
                    'bool(cand)':
                        bool(cand[-1]),
                    f'name matches {stats["namescore"]}\n':
                        bool(cands),
                    f'is shortest sem. distance of {stats["semdists"]}\n':
                        bool(cands),
                    f'is longest length of: {stats["size"]}\n':
                        bool(cands),
                    f'is shortest Levenshtein distance: {stats["ldist"]}\n':
                        bool(cands),
                    f'is closest last slot of: {stats["lslot"]}\n':
                        bool(cands),
                }
            },
            {
                'element': cx,
                'name': 'appo',
                'pattern': 'PP',
                'kind': self.kind,
                'roles': {'appo':cx, 'head': cand[-1]},
                'conds': {
                    'name(cx) == prep':
                        cx.name == 'prep_ph',
                    'P(-1).name != conj':
                        P(-1).name != 'conj',
                    'bool(cand)':
                        bool(cand[-1]),
                    f'name matches {stats["namescore"]}\n':
                        bool(cands),
                    f'is shortest sem. distance of {stats["semdists"]}\n':
                        bool(cands),
                    f'is longest length of: {stats["size"]}\n':
                        bool(cands),
                    f'is shortest Levenshtein distance: {stats["ldist"]}\n':
                        bool(cands),
                    f'is closest last slot of: {stats["lslot"]}\n':
                        bool(cands)
                }
            }        
        )
    
    
    def adjacent(self, cx):
        """Find adjacent CXs"""
        
        P = self.getP(cx)
        
        return self.test(
            {
                'element': cx,
                'name': 'appo',
                'kind': self.kind,
                'roles': {'head':cx, 'appo':P(1)},
                'conds': {
                    'cx.name != conj':
                        cx.name != 'conj',
                    'P(1).name != prep':
                        P(1).name != 'prep',
                    'bool(P(1))':
                        bool(P(1)),
                    f'name({P(1).name}) not in (conj, prep_ph)':
                        P(1).name not in {'conj','prep_ph'},
                }
            }
        )


In [113]:
cxp = Phrases(phrase2cxs, semdist, A)

## Tests

In [71]:
# A.show(A.search('''

# timephrase
#     word pdp=subs ls#card|prpe lex#KL/|JWM/ st=a

#     <: word lex=JWM/
# ''')[:10])

In [73]:
# the following phrases contain cases that still
# need to be fixed for the coordinate cx; some should
# actually be done in the previous cx builder at subphrase level

to_fix = [
    1450039, # coord, add adjacent advb cx with JWM
    1450647, # coord, consider prioritizing Levenshtein over size
    
]

### Test Small

In [100]:
test = cxp.coord(phrase2cxs[1449813][-1])

showcx(test, conds=True) 

{   '__cx__': 'coord',
    'conj': {'__cx__': 'conj', 'head': 281785},
    'part1': {'__cx__': 'cont', 'head': 281784},
    'part2': {'__cx__': 'cont', 'head': 281786}}

-- CX coord (281784, 281785, 281786) --
pattern: coord
P(-1).name == conj                                       True
bool(cand)                                               True
name matches 
	CX cont (281784,) namescore: 0
	CX cont (281783,) namescore: 0
	CX prep_ph (281782, 281783, 281784) namescore: 1
	CX geni_ph (281783, 281784) namescore: 1
	CX prep (281782,) namescore: 1
                           True
is shortest sem. distance of 
	0.7, <RPL/ ~ <NN/, CX cont (281784,)
	1.08, <RPL/ ~ JWM/, CX cont (281783,)
	1.08, <RPL/ ~ JWM/, CX prep_ph (281782, 281783, 281784)
	1.08, <RPL/ ~ JWM/, CX geni_ph (281783, 281784)
	inf, <RPL/ ~ B, CX prep (281782,)
                           True
is longest length of: 
	CX cont (281784,) length: 1
	CX cont (281783,) length: 1
	CX prep_ph (281782, 281783, 281784) length: 3
	CX geni_

In [103]:
F.freq_lex.v(363638)

3

In [104]:
L.u(363638,'lex')

(1443373,)

In [102]:
test = cxp.coord(phrase2cxs[1450668][-1])

showcx(test, conds=True) 

{   '__cx__': 'coord',
    'conj': {'__cx__': 'conj', 'head': 363637},
    'part1': {'__cx__': 'cont', 'head': 363636},
    'part2': {'__cx__': 'cont', 'head': 363638}}

-- CX coord (363636, 363637, 363638) --
pattern: coord
P(-1).name == conj                                       True
bool(cand)                                               True
name matches 
	CX cont (363636,) namescore: 0
	CX cont (363635,) namescore: 0
	CX geni_ph (363635, 363636) namescore: 1
                           True
is shortest sem. distance of 
	0.68, MRWD/ ~ <NJ=/, CX cont (363636,)
	1.39, MRWD/ ~ JWM/, CX cont (363635,)
	1.39, MRWD/ ~ JWM/, CX geni_ph (363635, 363636)
                           True
is longest length of: 
	CX cont (363636,) length: 1
	CX cont (363635,) length: 1
	CX geni_ph (363635, 363636) length: 2
                           True
is shortest Levenshtein distance: 
	CX cont (363636,) dist: 4
	CX cont (363635,) dist: 5
	CX geni_ph (363635, 363636) dist: 6
                           True

In [107]:
L.u(862564,'timephrase')

(1450558,)

In [108]:
test = cxp.coord(phrase2cxs[1450558][-1])

showcx(test, conds=True) 

{   '__cx__': 'coord',
    'conj': {'__cx__': 'conj', 'head': 348667},
    'part1': {'__cx__': 'cont', 'head': 348663},
    'part2': {'__cx__': 'cont', 'head': 348668}}

-- CX coord (348663, 348667, 348668) --
pattern: coord
P(-1).name == conj                                       True
bool(cand)                                               True
name matches 
	CX cont (348663,) namescore: 0
                           True
is shortest sem. distance of 
	0.68, >PLH/ ~ JWM/, CX cont (348663,)
                           True
is longest length of: 
	CX cont (348663,) length: 1
                           True
is shortest Levenshtein distance: 
	CX cont (348663,) dist: 6
                           True
is closest last slot of: 
	CX cont (348663,) last slot: 348663
                           True

-- CX cont (348668,) --
pattern: cont
bool(F.pdp.v(348668))                                    True

-- CX conj (348667,) --
pattern: conj
bool(F.pdp.v(348667))                                    Tr

### Stretch Test

In [98]:
testph = phrase2cxs[1446841]

In [100]:
# test = cxp.analyzestretch(
#     testph, 
#     duplicate=True,
#     debug=True)

# for res in test:
#     showcx(res, conds=True)

<hr>

# TOFIX:
* fix apposition - 1447545 (צען מצרים)

# TOTEST: 

1450333 - from apposition to proper name

<hr>

Print total number of phrases left to parse:

In [85]:
print(
    len([cx_tuple for cx_tuple in phrase2cxs.values() if len(cx_tuple) > 1])
)

565


In [78]:
def filt_gaps(cx):
    """Isolate cxs with gaps"""
    timephrase = L.u(next(iter(cx.slots)),'phrase')[0]
    if set(L.d(timephrase,'word')) - cx.slots:
        return True
    else:
        return False
    
def filt(cx):
    """Find specific lexeme"""
    timephrase = L.u(next(iter(cx.slots)),'phrase')[0]
    phrasewords = L.d(timephrase, 'word')
    if (
        {'JWM/', 'LJLH/'}.issubset(set(F.lex.v(w) for w in phrasewords))
        and len(phrasewords) == 3
    ):
        return True
    else:
        return False

In [115]:
# elements = [
#     cx for ph in list(phrase2cxs.values())
#         for cx in ph
# ]

# results = search(
#     elements, 
#     cxp.L_anchor, 
#     pattern='',
#     shuffle=False,
#     #select=lambda c: filt(c),
#     extraFeatures='lex st',
#     show=100
# )

## Stretch Tests

Testing across a whole phrase.

In [53]:
# test = cxp.analyzestretch(phrase2cxs[1449168], debug=True)
# for res in test:
#     showcx(res, conds=False)