# Time Phrase Curation

The starting point for this analysis are phrases in the ETCBC's [BHSA dataset](https://github.com/etcbc/bhsa) with a `function` feature value of `Time`. But these phrases are technically very similar to phrases marked with a function of `Adju` ("adjunct"), except with a further specification. The `Time` phrases in the BHSA are not likely to be "perfect" and there may be some inconsistencies. The purpose of this notebook is to curate time phrases that are used for all analyses in this project. This involves rigorously querying for anomalies and manually checking phrases that will be included in the analysis.

The custom features `head` and `nhead` are also crucial for processing various data on these time phrases (see [heads repository](https://github.com/etcbc/heads)). The `head` feature specifies a semantic phrase head while the `nhead` feature specifies nominal heads, including those subsumed under a preposition. The difference between the two features is that, for prepositional phrases, `head` links to the preposition head, whereas `nhead` links beyond the preposition (and through any chained prepositions) to the non-quantified nominal element that is governed by it. The benefit of the `nhead` is one can determine the primary semantic element within the phrase without referencing the functional prepositions.

The `head` and `nhead` features are experimental, and thus all of the identified heads need to be validated in order to be utilized further in the analysis. This notebook will also check the features for all of the time phrase tokens (tokenized surface forms). 

In [1]:
import collections, random, csv
import pandas as pd
from tf.app import use

# load BHSA
A = use('bhsa', hoist=globals(), mod='etcbc/heads/tf', check=True)
A.api.TF.load('''g_cons_utf8 prs_ps prs prs_nu''', add=True)

TF app is up-to-date.
Using annotation/app-bhsa commit d3cf8f0c2ab5d690a0fda14ea31c33da5c5c8483 (=latest)
  in /Users/cody/text-fabric-data/__apps__/bhsa.
No new data release available online.
Using etcbc/bhsa/tf - c rv1.6 (=latest) in /Users/cody/text-fabric-data.
No new data release available online.
Using etcbc/phono/tf - c r1.2 (=latest) in /Users/cody/text-fabric-data.
No new data release available online.
Using etcbc/parallels/tf - c r1.2 (=latest) in /Users/cody/text-fabric-data.
No new data release available online.
Using etcbc/heads/tf - c rv.1.2.1 (=latest) in /Users/cody/text-fabric-data.


  0.00s loading features ...
   |     0.14s B prs                  from /Users/cody/text-fabric-data/etcbc/bhsa/tf/c
  0.15s All additional features loaded - for details use loadLog()


In [2]:
def tokenPhrase(phrasenode):
    '''Tokenizes a phrase with
    dot-separated words.
    input: phrase node number
    output: token string'''
    words = [(F.g_cons_utf8.v(w) if F.lex.v(w) != 'H' else 'ה') for w in L.d(phrasenode, 'word')]
    return '.'.join(words)

def tokenHeads(headslist):
    '''same as tokenPhrase but with list of head word nodes'''
    return '.'.join((F.g_cons_utf8.v(w) if F.lex.v(w) != 'H' else 'ה') for w in headslist)

## Phrase Tokens and Phrase Heads

This search counts all of the discrete time phrase tokens in Hebrew and gathers data about their heads. This data is exported to a spreadsheet for manual inspection. Per every token, a key of its heads is saved into a dictionary, linked to a list of phrase nodes. Tokens that have more than 1 head are suspicious, since the surface form is the same. All other tokens will be exported with their standard heads for inspection. 

In [4]:
tp_heads = collections.defaultdict(lambda: collections.defaultdict(list))
tp_nheads = collections.defaultdict(lambda: collections.defaultdict(list))
tp_count = collections.Counter()

tps = A.search('''

phrase function=Time
/with/
    word language=Hebrew
/-/

''', shallow=True)

for tp in tps:
    token = tokenPhrase(tp)
    heads_token = tokenHeads(E.head.t(tp))
    nheads_token = tokenHeads(E.nhead.t(tp))
    
    
    tp_heads[token][heads_token].append(tp)
        
    # only populate nheads with PP phrases, since nhead feature for NP is exactly the same
    if F.typ.v(tp) == 'PP':
        tp_nheads[token][nheads_token].append(tp)
        
    tp_count[token] += 1
    
suspect_heads = [tp for tp in tp_heads if len(tp_heads[tp]) > 1]
suspect_nheads = [tp for tp in tp_nheads if len(tp_nheads[tp]) > 1]

print(f'total phrase tokens 2 head mappings: {len(tp_heads)}')
print(f'total phrase tokens 2 nhead mappings: {len(tp_nheads)}')
print(f'total suspect heads: {len(suspect_heads)}')
print(f'total suspect nheads {len(suspect_nheads)}')

  1.14s 3961 results
total phrase tokens 2 head mappings: 1171
total phrase tokens 2 nhead mappings: 894
total suspect heads: 0
total suspect nheads 0


**NB**<br>
The initial run of this search found problems in the phrase: ב.ה.בקר.ב.ה.בקר. Some cases marked the second part of the phrase a parallel element, whereas others marked them as either a phrase atom specification relation (`Spec`) or a subphrase adjunct relation (`adj`). This is an inconsistent tagging on the BHSA's part. These issues were addressed in the [heads notebook](https://nbviewer.jupyter.org/github/ETCBC/heads/blob/master/phrase_heads.ipynb) of the ETCBC heads repository. The phrase in question is now correctly annotated.

### Compile Manual Inspection Spreadsheet

In [5]:
tp_heads_data = []
tp_nheads_data = []
data_header = ['token', '(n)heads_token', 'freq', 'mark', 'note', 'ex_ref', 'ex', 'ex_node', 'ex_verse']

for htp, nhtp in zip(tp_heads.keys(), tp_nheads.keys()):
    head = next(tp for tp in tp_heads[htp])
    nhead = next(tp for tp in tp_nheads[nhtp])
    head_ex = random.choice(tp_heads[htp][head])
    nhead_ex = random.choice(tp_nheads[nhtp][nhead])
    
    head_ref, nhead_ref = ['{} {}:{}'.format(*T.sectionFromNode(ex)) for ex in (head_ex, nhead_ex)]
    head_txt, nhead_txt = [T.text(ex) for ex in (head_ex, nhead_ex)]
    head_verse, nhead_verse = [T.text(L.u(ex, 'verse')[0]) for ex in (head_ex, nhead_ex)]
    
    heads_data = [htp, head, tp_count[htp], '', '', head_ref, head_txt, head_ex, head_verse]
    nheads_data = [nhtp, nhead, tp_count[nhtp], '', '', nhead_ref, nhead_txt, nhead_ex, nhead_verse]
    tp_heads_data.append(heads_data)
    tp_nheads_data.append(nheads_data)
    
tp_heads_data, tp_nheads_data = sorted(tp_heads_data), sorted(tp_nheads_data)

In [6]:
with open('manual_curation/tp_heads.csv', 'w') as outfile:
    writer = csv.writer(outfile)
    writer.writerow(data_header)
    writer.writerows(tp_heads_data)
    
with open('manual_curation/tp_nheads.csv', 'w') as outfile:
    writer = csv.writer(outfile)
    writer.writerow(data_header)
    writer.writerows(tp_nheads_data)

## Investigating Texts

In [22]:
show = [(880243,)]

A.show(show, withNodes=True)

In [13]:
print(F.prs_ps.v(354805), F.prs_nu.v(354805))

p1 sg


In [20]:
E.nhead.t(725623)

(131931,)

In [13]:
F.lex.v(412520)

'>XR/'

## Orphaned Time Phrases

This search checks all time phrase surface forms against phases marked with "Adju" (adjunct) to identify potential candidates for orphaned time phrases.