# Composing Time Constructions

In this notebook we construct a rudimentary parser for parsing the various roles of Hebrew time phrases.

In [1]:
import sys
import collections
import pickle
import random
import re
import copy
import numpy as np
import networkx as nx
from datetime import datetime
import matplotlib.pyplot as plt
from Levenshtein import distance as lev_dist
from pprint import pprint

# local packages
from tf_tools.load import load_tf

# load semantic vectors
from locations import semvector
with open(semvector, 'rb') as infile: 
    semdist = pickle.load(infile)

# load and configure Text-Fabric
TF, api, A = load_tf()
F, E, T, L = api.F, api.E, api.T, api.L
A.displaySetup(condenseType='phrase', withNodes=True, extraFeatures='st')

This is Text-Fabric 7.8.12
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

119 features found and 6 ignored
  0.00s loading features ...
   |     0.00s No structure info in otext, the structure part of the T-API cannot be used
  5.03s All features loaded/computed - for details use loadLog()


# Machinery

We could use some machinery to do the hard work of looking in and around a node. In the older approach we used TF search templates. But these are not very efficient at scale, and they are always bound by the limits of the query language. I take another approach here: a set of classes that specify locations and directions within a specified context.

In [2]:
from positions import Positions, PositionsTF, Walker, Dummy

## `Positions(TF)`

The `Positions` class enables concise access to adjacent nodes within a given context. This allows us to write algorithms with query-like efficiency with all of the power of Python. 

This class is instantiated on a word node and can provide contextual look-up data for a given word. For example, given a phrase containing the following word nodes:

> (189681, 189682, **189683**, 189684, 189685, 189686) <br>

representing the following phrase (space separated for clarity):

> ב שׁנת **שׁלשׁים** ו שׁמנה שׁנה

Given that the bolded node, `189683` is our `source` word, we instantiate the class, feeding in the node, the "phrase_atom" string (which is the context we want to search within), and an instance of Text-Fabric (`tf`):

In [3]:
      #    source node    context  TF instance  
      #         |            |       |
P = PositionsTF(189683, 'phrase_atom', A).get

If we want to obtain the word adjacent one space forward, we simply ask `P` for `1`, which gives us the next word in the phrase.

In [4]:
P(1)

189684

If we try to ask for 4 words forward, we go beyond the bounds of the phrase. But `P` handles this by returning nothing:

In [5]:
P(4)

To look back one word, we simply give a negative value:

In [6]:
P(-1)

189682

Finally, `P` can be used to quickly call features on these words. For instance, in order to get the lexeme of the word two words in front of `189683`:

In [7]:
P(2,'lex')

'CMNH/'

And if we want to get a number of features, we can just add other features to the arguments. The result is a feature set:

In [8]:
P(2, 'lex', 'nu')

{'CMNH/', 'sg'}

`P` can also handle features on the source node itself by giving a positionality of `0`:

In [9]:
P(0, 'lex')

'CLC/'

### `Positions` also exists in a non-TF version

When the non-tf version of `Positions` is provided any iterable, it can perform the same functions.

In [10]:
test_ps = ['The', 'good', 'dog', 'jumped.']

P = Positions('good', test_ps).get

In [11]:
P(1)

'dog'

Positions can perform a function on the result with an option `do`. In the example below, the word two words ahead is found and an upper-case function is called on the string.

In [12]:
P(2, do=lambda w: w.upper())

'JUMPED.'

The non-tf version of `Positions` makes it possible to do positionality searches with any ordered list of Python objects that represent linguistic units.

## `Walker`

`Walker` performs a similar function to `Positions`, except it is ambiguous to exact positions, walking either `ahead` or `back` from the source to a target node in the context. A function must be supplied that returns `True` on the target node.

We instantiate the `Walker` using the same source and context as above.

In [13]:
source = 189683
# get words inside source's phrase_atom
positions = L.d(
    L.u(189683,'phrase_atom')[0], 'word'
)

Wk = Walker(source, positions)

`Walker` is demonstrated below with the same word. A simple `lambda` function is used to test for the lexeme. In the example below, we find the first word ahead of `189683` that is a cardinal number:

In [14]:
Wk.ahead(lambda w: F.ls.v(w) == 'card')

189685

An alternative demonstrates the `None` returned on the lack of a valid match.

In [15]:
Wk.ahead(lambda w: F.ls.v(w) == 'BOOGABOOGA')

Another example wherein we walk backwards to the preposition:

In [16]:
Wk.back(lambda w: F.sp.v(w) == 'prep')

189681

We can also specify that the walk should be interrupted under certain conditions with a `stop` function. In this case we walk forward to the next cardinal number, but the walk is interrupted when the `stop` function detects a conjunction.

In [17]:
Wk.ahead(lambda w: F.ls.v(w) == 'card',
         stop=lambda w: F.sp.v(w) == 'conj')

We can also specify the opposite with a `go` function argument, which defines the nodes that allowed to intervene between `source` and `target`. Below we specify that *only* a conjunction should intervene.

In [18]:
Wk.ahead(lambda w: F.ls.v(w) == 'card',
         go=lambda w: F.sp.v(w) == 'conj')

189685

The `go` and `stop` functions can be as permissive or strict as desired.

Finally, we can tell `Walker` that the output of the validation function should be returned instead of the node itself with the optional argument `output=True`:

In [19]:
val_funct = lambda w: F.ls.v(w) if F.ls.v(w)=='card' else None

Wk.ahead(val_funct, output=True)

'card'

This ability is useful for certain tests.

Like `Positions`, `Walker` can be used in non-TF contexts:

In [20]:
test_ps = ['The', 'bad', 'cat', 'swatted.']

Wk_notf = Walker('bad', test_ps)

In [21]:
Wk_notf.ahead(lambda w: w.startswith('sw'))

'swatted.'

### Returning All Results along Path

`Walker` can also return all results along the path by toggling `every=True`

In [22]:
Wk_notf.ahead(lambda w: type(w)==str, every=True)

['cat', 'swatted.']

## `Dummy`

When writing conditions and logic, we want an object that passively receives `NoneType`s or zero `int`s without throwing errors. Such an object should also return `None` to reflect its `False` value. `Dummy`, provides such functionality. `Dummy` can receive all of the arguments, kwargs, and function calls as a `Positions` or `Walker` object. But it returns absolutely nothing. Ouch.

In [23]:
D = Dummy(None, 'phrase_atom', A)

The function call below returns `None`:

In [24]:
D.get(1)

As does this:

In [25]:
D.get(1, 'lex')

And even this:

In [26]:
D.ahead(1)

`D` is essentially a souless void that consumes whatever you throw at it and gives nothing in return.

For safe-calls on a `Position` or `Walker` object, assign nodes to it via a function with a `Dummy` given on null nodes:

In [27]:
def getPos(node, context, tf):
    """A function to get Positions safely."""
    if node:
        return PositionsTF(node, context, tf)
    else:
        return Dummy() # <- give dummy on empty node

So:

In [28]:
P = getPos(None, 'phrase_atom', A)
P.get(1)

Or:

In [29]:
P = getPos(1, 'phrase_atom', A)
P.get(1)

2

# Need for Semantic Data

The accurate processing of word connections depends on fuller semantic data than BHSA provides. Future semantic data could be stored in a similar way to word sets (`wsets`). 

For example, in the two phrases

> (Exod 25:39) ככר זהב טהור <br>
> (2 Sam 24:24) בכסף שקלים חמשׁים

we see that זהב and כסף, despite being in two different positions with two different words indicates a kind of "composed of" semantic concept: "round gold" (i.e. round composed of gold) and "silver shekels" (shekels composed of silver). To process these kinds of links, we need a list of nouns that often function as "material." But this is only the beginning. Many other words will have specific semantic values that motivate their syntactic behavior. Such a scope lies outside the bounds of this author's current project on Hebrew time phrases.

## A Compromise: Time Phrases

Since constructing these semantic classes is vastly time consuming, I want to start with a smaller set of cases. I will instead focus on parsing connections within time phrases for now. This is because I am analyzing time phrases in my current ongoing PhD project. 

In [30]:
def disjoint(ph):
    """Isolate phrases with gaps."""
    ph = L.d(ph,'word')
    for w in ph:
        if ph[-1] == w:
            break
        elif (ph[ph.index(w)+1] - w) > 1:
            return True

In [31]:
alltimes = [
    ph for ph in F.otype.s('timephrase') 
]
    
timephrases = [ph for ph in alltimes if not disjoint(ph)]

print(f'{len(timephrases)} phrases ready')

3864 phrases ready


## Search & Display Functions

The functions below allow for fast searching and displaying of queries using a `Construction` object, described in the next section.

In [32]:
from cx_analysis.search import SearchCX

In [33]:
cx_show = SearchCX(A)
pretty, prettyconds, showcx, search = (
    cx_show.pretty, cx_show.prettyconds, 
    cx_show.showcx, cx_show.search
)

## Construction Classes

* `Construction` - an object that represents a linguistic construction; the class records roles and the words that occupy them, as well as has methods for accessing and retrieving data on embedded roles/other constructions
* `CXBuilder` - matches conditions to build `Construction` objects; populates them with requisite data

In [34]:
from cx_analysis.cx import Construction
from cx_analysis.build import CXbuilder, CXbuilderTF

## Word Constructions

The `wordConstructions` builder class recognizes word semantic classes and types based on provided criteria.

In [35]:
from cx_analysis.word_grammar import Words

## Subphrase Constructions

The `SPConstructions` class prepares subphrase constructions.

In [36]:
from cx_analysis.phrase_grammar import Subphrases

### Load Constructions

In [37]:
words = Words(A) # word CX builder

# analyze all matches; return as dict
start = datetime.now()
print(f'Beginning word construction analysis...')
wordcxs = words.cxdict(
    s for tp in timephrases
        for s in L.d(tp,'word')
)
print(f'\t{datetime.now() - start} COMPLETE \t[ {len(wordcxs)} ] words loaded')

Beginning word construction analysis...
	0:00:07.750267 COMPLETE 	[ 12887 ] words loaded


In [38]:
# time phrase CX builder
spc = Subphrases(wordcxs, semdist, A)

## Pickle Tests

In [39]:
# def tuplify_graph(node):
#     """Convert a NetworkX constructional graph into a tuple"""
#     return tuple(
#         (n1, n2, {'role': node.graph[n1][n2]['role']})
#              for n1, n2 in nx.bfs_edges(node.graph, node)
#     )

# def graphify_tuple(graph_tuple):
#     """Convert a graph tuple back into NetworkX graph"""
#     return nx.Digraph(graph_tuple)

# def package_graph(node, graph=None):
#     """Recursively turn all cxs and contained cxs into tuples"""
    
#     # map graph nodes to tuple
#     if graph is None:
#         new_graph = tuplify_graph(node)
#         # bequeath to all embedded nodes
#         for nn in node.graph:        
#             if type(nn) == Construction and type(nn.graph) != tuple:
#                 package_graph(nn, graph=new_graph)
#         node.graph = new_graph
    
#     # assign to tuple
#     else:
#         node.graph = graph
        
# def unpackage_graph(node, graph=None):
#     """Recursively turn all tupled graphs into NetworkX graphs"""
    
#     # map graph nodes to tuple
#     if graph is None:
#         node.graph = nx.DiGraph(node.graph)
#         # bequeath to all embedded nodes
#         for nn in node.graph:        
#             if type(nn) == Construction and type(nn.graph) == tuple:
#                 unpackage_graph(nn, graph=node.graph)
    
#     # assign to tuple
#     else:
#         node.graph = graph

# TO-FIX

* missed appo 361457 cx: 1450112 (בחים מספר ימי חיי הבלו)

### Small Tests

In [40]:
# pretty(1448320)

In [41]:
test_small = spc.appo_name(202679)
#showcx(test_small, conds=True)

### Stretch Tests

In [42]:
# On deck: adjectival preposition
# check performance: 1448556

test = spc.analyzestretch(L.d(1450075, 'word'), debug=False)

# for res in test:
#     showcx(res, conds=True)

### Pattern Searches

In [43]:
# words = [w for ph in timephrases for w in L.d(ph, 'word')]

# results = search(words, spc.appo_name, pattern='entity_name', show=100, shuffle=False)


### Analyze Results

In [44]:
# for res in results:
#     head, appo = list(res.getsuccroles('head'))[-1], list(res.getsuccroles('appo'))[-1]
#     hlex, alex = F.lex.v(int(head)), F.lex.v(int(appo))
    
#     showcx(res)
#     print()
#     print(f'lexs: {hlex} x {alex}')
#     print(f'dist: {semdist[hlex][alex]}')
#     print()

### Stretch Tests on Results

In [45]:
# elements = sorted(set(L.u(res.element, 'timephrase')[0] for res in results))

# for el in elements:
    
#     stretch = L.d(el, 'word')
#     test = spc.analyzestretch(stretch)
    
#     for res in test:
#         showcx(res)

### Testing on Random Phrases

In [46]:
# shuff = [k for k in timephrases
#             if len(L.d(k,'word')) > 4]
# random.shuffle(shuff)

In [47]:
# for phrase in shuff[:25]:
    
#     print('analyzing', phrase)
#     elements = L.d(phrase,'word')
    
#     try:
#         cxs = tpc.analyzestretch(elements)
#         if cxs:
#             for cx in cxs:
#                 showcx(cx, refslots=elements)
#         else:
#             showcx(Construction(), refslots=elements)
    
#     except:
#         sys.stderr.write(f'\nFAIL...running with debug...\n')
#         pretty(phrase)
#         tpc.analyzestretch(elements, debug=True)
#         raise Exception('...debug complete...')

### Testing on All Timephrases

In [48]:
phrase2cxs = collections.defaultdict(list)
nocxs = []

# time it
start = datetime.now()

print(f'{datetime.now()-start} beginning subphrase analysis...')

for i, phrase in enumerate(timephrases):
     
    # analyze all known relas
    elements = L.d(phrase,'word')
    
    # analyze with debug exceptions
    try:
        cxs = spc.analyzestretch(elements)
    except:
        sys.stderr.write(f'\nFAIL...running with debug...\n')
        pretty(phrase)
        spc.analyzestretch(elements, debug=True)
        raise Exception('...debug complete...')

    # save those phrases that have no matching constructions
    if not cxs:
        nocxs.append(phrase)
    else:
        phrase2cxs[phrase] = cxs
        
    # report status
    if i % 500 == 0 and i:
        print(f'\t{datetime.now()-start}\tdone with iter {i}/{len(timephrases)}')
        
print(f'{datetime.now()-start}\tCOMPLETE')
print('-'*20)
print(f'{len(phrase2cxs)} phrases matched with Constructions...')
print(f'{len(nocxs)} phrases not yet matched with Constructions...')

0:00:00.000046 beginning analysis...
	0:00:15.457987	done with iter 500/3864
	0:00:32.351404	done with iter 1000/3864
	0:00:50.386257	done with iter 1500/3864
	0:01:05.639771	done with iter 2000/3864
	0:01:24.600140	done with iter 2500/3864
	0:01:44.746979	done with iter 3000/3864
	0:01:59.328218	done with iter 3500/3864
0:02:13.405888	COMPLETE
--------------------
3864 phrases matched with Constructions...
0 phrases not yet matched with Constructions...


## Closing Gaps

### Identify Gaps

Find timephrases that contain un-covered words besides waw conjunctions.

In [48]:
# gapped = []
# tested = []

# for ph, cxs in phrase2cxs.items():
    
#     tested.append(ph)
    
#     ph_slots = set(
#         s for s in L.d(ph,'word')
#     )
#     cx_slots = set(
#         s for cx in cxs
#             for s in cx.slots
#     )
    
#     if ph_slots.difference(cx_slots):
#         gapped.append(cxs)
        
# print(f'{len(gapped)} gapped phrases logged...')

In [49]:
# for gp in gapped[:25]:
#     for cx in gp:
#         showcx(cx)

## Connecting Constructions

Developing a CXbuilder to connect all constructions in a complete phrase.


### Ambiguity with Coordinate CXs

Considerable ambiguity is present in several coordinate constructions:

**`A B and C`**<br>
Given A, B, C == nominal words. Is their relationship `A // B // C` or `A+B // C`. In other words: **what is the relationship of two adjacent nominal words given a list?** Is B a descriptor of A or is it an independent element? 

**`A of B and C`**<br>
Is it, `(A of B) // (C)` or `(A of (B // C)`

Or even:

**`A of B C and D`**<br>
This pattern combines elements from both ambiguous cases.

### Method

To address these ambiguities we will apply a battery of disambiguation attempts. At the core of these attempts is a [Semantic Vector Space](https://en.wikipedia.org/wiki/Vector_space_model), which is able to quantify the semantic distance between two words based on their contextual uses throughout the Hebrew Bible.

The working hypothesis of this method is
> Words in coordination with each other will be more semantically similar (i.e. the least distance in the vector space) than other candidates in the phrase.

Semantic similarity in a vector space is not the only method used, however. Another aspect of semantic closeness is phrase structure. For instance, the identity of phrase types is taken into consideration above semantic similarity. 

In [55]:
from cx_analysis.phrase_grammar import Phrases
cxp = Phrases(phrase2cxs, semdist, A)

NameError: name 'phrase2cxs' is not defined

## Tests

In [94]:
# A.show(A.search('''

# timephrase
#     word pdp=subs ls#card|prpe lex#KL/|JWM/ st=a

#     <: word lex=JWM/
# ''')[:10])

In [95]:
# the following phrases contain cases that still
# need to be fixed for the coordinate cx; some should
# actually be done in the previous cx builder at subphrase level

to_fix = [
    1450039, # coord, add adjacent advb cx with JWM
    1450647, # coord, consider prioritizing Levenshtein over size
    
]

### Test Small

In [140]:
testph = phrase2cxs[1449445]
testph

[CX prep_ph (284192, 284193, 284194, 284195, 284196), CX cont (284197,)]

In [141]:
# test = cxp.appo(testph[1])

# showcx(test, conds=False) 

### Stretch Test

In [98]:
testph = phrase2cxs[1446841]

In [100]:
# test = cxp.analyzestretch(
#     testph, 
#     duplicate=True,
#     debug=True)

# for res in test:
#     showcx(res, conds=True)

<hr>

# TOFIX:
* fix apposition - 1447545 (צען מצרים)

# TOTEST: 

1450333 - from apposition to proper name

<hr>

In [78]:
def filt_gaps(cx):
    """Isolate cxs with gaps"""
    timephrase = L.u(next(iter(cx.slots)),'phrase')[0]
    if set(L.d(timephrase,'word')) - cx.slots:
        return True
    else:
        return False
    
def filt(cx):
    """Find specific lexeme"""
    timephrase = L.u(next(iter(cx.slots)),'phrase')[0]
    phrasewords = L.d(timephrase, 'word')
    if (
        {'JWM/', 'LJLH/'}.issubset(set(F.lex.v(w) for w in phrasewords))
        and len(phrasewords) == 3
    ):
        return True
    else:
        return False

In [101]:
# elements = [
#     cx for ph in list(phrase2cxs.values())
#         for cx in ph
# ]

# results = search(
#     elements, 
#     cxp.appo, 
#     pattern='PP',
#     shuffle=False,
#     #select=lambda c: filt(c),
#     extraFeatures='lex st',
#     show=150
# )

## Stretch Tests

Testing across a whole phrase.

In [53]:
# test = cxp.analyzestretch(phrase2cxs[1449168], debug=True)
# for res in test:
#     showcx(res, conds=False)