# Time Phrase Curation

The starting point for this analysis are phrases in the ETCBC's [BHSA dataset](https://github.com/etcbc/bhsa) with a `function` feature value of `Time`. But these phrases are technically very similar to phrases marked with a function of `Adju` ("adjunct"), except with a further specification. The `Time` phrases in the BHSA are not likely to be "perfect" and there may be some inconsistencies. The purpose of this notebook is to curate time phrases that are used for all analyses in this project. This involves rigorously querying for anomalies and manually checking phrases that will be included in the analysis.

The custom features `head` and `nhead` are also crucial for processing various data on these time phrases (see [heads repository](https://github.com/etcbc/heads)). The `head` feature specifies a semantic phrase head while the `nhead` feature specifies nominal heads, including those subsumed under a preposition. The difference between the two features is that, for prepositional phrases, `head` links to the preposition head, whereas `nhead` links beyond the preposition (and through any chained prepositions) to the non-quantified nominal element that is governed by it. The benefit of the `nhead` is one can determine the primary semantic element within the phrase without referencing the functional prepositions.

The `head` and `nhead` features are experimental, and thus all of the identified heads need to be validated in order to be utilized further in the analysis. This notebook will also check the features for all of the time phrase tokens (tokenized surface forms). 

In [1]:
import collections, random, csv
import pandas as pd
from tf.fabric import Fabric
from tf.app import use

# load BHSA
A = use('bhsa', hoist=globals(), mod='etcbc/heads/tf', check=True)
A.api.TF.load('''g_cons_utf8 
                 prs_ps prs prs_nu
                 head obj_prep
                 ''', add=True)

TF app is up-to-date.
Using annotation/app-bhsa commit d3cf8f0c2ab5d690a0fda14ea31c33da5c5c8483 (=latest)
  in /Users/cody/text-fabric-data/__apps__/bhsa.
No new data release available online.
Using etcbc/bhsa/tf - c rv1.6 (=latest) in /Users/cody/text-fabric-data.
No new data release available online.
Using etcbc/phono/tf - c r1.2 (=latest) in /Users/cody/text-fabric-data.
No new data release available online.
Using etcbc/parallels/tf - c r1.2 (=latest) in /Users/cody/text-fabric-data.
No new data release available online.
Using etcbc/heads/tf - c rv.1.3.1 (=latest) in /Users/cody/text-fabric-data.


  0.00s loading features ...
   |     0.11s B prs                  from /Users/cody/text-fabric-data/etcbc/bhsa/tf/c
  0.12s All additional features loaded - for details use loadLog()


In [2]:
def tokenPhrase(phrasenode):
    '''Tokenizes a phrase with
    dot-separated words.
    input: phrase node number
    output: token string'''
    words = [(F.g_cons_utf8.v(w) if F.lex.v(w) != 'H' else 'ה') for w in L.d(phrasenode, 'word')]
    return '.'.join(words)

def tokenHeads(headslist):
    '''same as tokenPhrase but with list of head word nodes'''
    return '.'.join((F.g_cons_utf8.v(w) if F.lex.v(w) != 'H' else 'ה') for w in headslist)

## Phrase Tokens and Phrase Heads

This search counts all of the discrete time phrase tokens in Hebrew and gathers data about their heads. This data is exported to a spreadsheet for manual inspection. Per every token, a key of its heads is saved into a dictionary, linked to a list of phrase nodes. Tokens that have more than 1 head are suspicious, since the surface form is the same. All other tokens will be exported with their standard heads for inspection. 

In [3]:
tp_heads = collections.defaultdict(lambda: collections.defaultdict(list))
tp_nheads = collections.defaultdict(lambda: collections.defaultdict(list))
tp_count = collections.Counter()

tps = A.search('''

phrase function=Time
/with/
    word language=Hebrew
/-/

''', shallow=True)

for tp in tps:
    token = tokenPhrase(tp)
    heads_token = tokenHeads(E.head.t(tp))
    nheads_token = tokenHeads(E.nhead.t(tp))
    
    
    tp_heads[token][heads_token].append(tp)
        
    # only populate nheads with PP phrases, since nhead feature for NP is exactly the same
    if F.typ.v(tp) == 'PP':
        tp_nheads[token][nheads_token].append(tp)
        
    tp_count[token] += 1
    
suspect_heads = [tp for tp in tp_heads if len(tp_heads[tp]) > 1]
suspect_nheads = [tp for tp in tp_nheads if len(tp_nheads[tp]) > 1]

print(f'total phrase tokens 2 head mappings: {len(tp_heads)}')
print(f'total phrase tokens 2 nhead mappings: {len(tp_nheads)}')
print(f'total suspect heads: {len(suspect_heads)}')
print(f'total suspect nheads {len(suspect_nheads)}')

  1.25s 3961 results
total phrase tokens 2 head mappings: 1171
total phrase tokens 2 nhead mappings: 894
total suspect heads: 0
total suspect nheads 0


**NB**<br>
The initial run of this search found problems in the phrase: ב.ה.בקר.ב.ה.בקר. Some cases marked the second part of the phrase a parallel element, whereas others marked them as either a phrase atom specification relation (`Spec`) or a subphrase adjunct relation (`adj`). This is an inconsistent tagging on the BHSA's part. These issues were addressed in the [heads notebook](https://nbviewer.jupyter.org/github/ETCBC/heads/blob/master/phrase_heads.ipynb) of the ETCBC heads repository. The phrase in question is now correctly annotated.

### Compile Manual Inspection Spreadsheet

In [4]:
# tp_heads_data = []
# tp_nheads_data = []
# data_header = ['token', '(n)heads_token', 'freq', 'mark', 'note', 'ex_ref', 'ex', 'ex_node', 'ex_verse']

# for htp, nhtp in zip(tp_heads.keys(), tp_nheads.keys()):
#     head = next(tp for tp in tp_heads[htp])
#     nhead = next(tp for tp in tp_nheads[nhtp])
#     head_ex = random.choice(tp_heads[htp][head])
#     nhead_ex = random.choice(tp_nheads[nhtp][nhead])
    
#     head_ref, nhead_ref = ['{} {}:{}'.format(*T.sectionFromNode(ex)) for ex in (head_ex, nhead_ex)]
#     head_txt, nhead_txt = [T.text(ex) for ex in (head_ex, nhead_ex)]
#     head_verse, nhead_verse = [T.text(L.u(ex, 'verse')[0]) for ex in (head_ex, nhead_ex)]
    
#     heads_data = [htp, head, tp_count[htp], '', '', head_ref, head_txt, head_ex, head_verse]
#     nheads_data = [nhtp, nhead, tp_count[nhtp], '', '', nhead_ref, nhead_txt, nhead_ex, nhead_verse]
#     tp_heads_data.append(heads_data)
#     tp_nheads_data.append(nheads_data)
    
# tp_heads_data, tp_nheads_data = sorted(tp_heads_data), sorted(tp_nheads_data)

In [5]:
# with open('manual_curation/tp_heads.csv', 'w') as outfile:
#     writer = csv.writer(outfile)
#     writer.writerow(data_header)
#     writer.writerows(tp_heads_data)
    
# with open('manual_curation/tp_nheads.csv', 'w') as outfile:
#     writer = csv.writer(outfile)
#     writer.writerow(data_header)
#     writer.writerows(tp_nheads_data)

# Produce Report on Manual Annotations

The manual annotations are intended to serve 2 roles: 1. to evaluate the accuracy of the head assignments on time phrases in the BHSA data, 2. to evaluate the time phrase structure in the BHSA dataset, and 3. to gain hands-on exposure to the kinds of time phrases in the dataset. This process consisted of comparing the selected heads against the surface text of the time phrase, and of reading the time phrases in the context of a verse when questions or anomalies arose. The annotation process consisted of marking a given time phrase as "g" for "good," "b" for "bad," and "?" for questionable cases. These classifications refer to both head assignments and internal structuring of the time phrases in the BHSA. The markings are often accompanied with notes: for bad or questionable entries the note explains what is wrong, for good entries the note might describe an interesting phenomenon, in some cases it might give a "light caution" about a given phrase. 

The annotations suggest that custom database is necessary to consistently represent time phrases: there are many cases in BHSA time phrases where the phrase is cleft into 2 adjacent parts whereas in the majority of the data they are kept together. This is an inconsistency that should be solved. In other cases, the notion of "phrase" is not broad enough to encompass the full range of expressions that can mark time. For instance, several time phrases are split off from the infinitives they direct, in which the infinitive is an event. This is because the ETCBC's strict structuralist methodology has defined the infinitive event as a clause, operating at a different hierarchical level than the phrase; phrases are defined as strictly non-predicative. In the framework of Construction Grammar (Goldberg, *Constructions*, 1995, Croft, *Radical Construction Grammar*, 2001) preferred by this study, these divisions are not necessary, and in fact hinder an accurate and comprehensive description. In Construction Grammar, no division is assumed between syntax and semantics, and thus the difference between a clause and a phrase is merely a difference in degree based on the two construction's forms and meanings, but it is not a fundamental difference in kind. Indeed, many phrases in this dataset refer to event-like nouns, which are from a syntactical perspective non-predicative; במות "in the death of" is a very frequent example, but other cases include ביום ישׁועה "in the day of salvation," which assumes a salvation event. If the difference between time phrases and time "clauses" is seen as merely an incremental difference rather than categorical, then one can apply a unified strategy in analyzing these cases. 

In [6]:
pd.set_option('display.max_colwidth', 0)  # configure DataFrame to show full notes with no truncation

In [7]:
head_anno = pd.read_csv('manual_curation/tp_heads_annotated.csv')
nhead_anno = pd.read_csv('manual_curation/tp_nheads_annotated.csv')

## Looking at Bad Cases

In [8]:
#head_anno[head_anno['mark'] == 'b'].to_csv('manual_curation/head_fixes.csv')

In [9]:
#nhead_anno[nhead_anno['mark']=='b'].to_csv('manual_curation/nhead_fixes.csv')

**All situations have been remedied. The chosen solution is stored under the new column, "fix", in `head_fixes.csv` and `nhead_fixes.csv`.**

## Looking at Questionable Cases

In [14]:
#head_anno[head_anno['mark'] == '?']

In [15]:
#nhead_anno[nhead_anno['mark']=='?']

# Custom Features and Objects

A number of edits to BHSA are needed for this project. These include corrections to existing data as well as the creation of new objects for the sake of analysis. This is done in the subsequent sections. A new node feature, `note`, is also introduced. `note` is a string feature that will be used to tag these edits and changes, as well as to add commentary and discussion to time objects. This will allow me to track specific situations and issues as they arise throughout my project, and to refer back to them later.

In [12]:
nodeFeatures=collections.defaultdict(lambda:collections.defaultdict())
edgeFeatures=collections.defaultdict(lambda:collections.defaultdict())

## Remap Functions

In [13]:
nodeFeatures['function'] = dict((ph, F.function.v(ph)) for ph in F.otype.s('phrase'))

newfunctions = {849296:'Loca',
                825329:'Loca',
                828081:'Cmpl',
                774349:'Adju',
                774352:'Adju',
                775948:'Adju',
                775985:'Adju',
                876172:'Adju',}

nodeFeatures['function'].update(newfunctions)

for ph, funct in newfunctions.items():
    nodeFeatures['note'][ph] = f'function remapped from {F.function.v(ph)} to funct'

In [14]:
A.pretty(L.u(846434, 'sentence')[0])

## Merge Phrases (`phrase2`)

Some phrases are unnecessarily split into two. This is fixed by creating a new object, `phrase2`. Later on, another object will be generated, `cx` (construction), which will contain any mixture of single words, phrases, and sentences, essentially ignoring the old, strict divisions.

In [15]:
# recreate ETCBC otype and oslots files
# these will be appended to rather than altered
nodeFeatures['otype'] = dict((n, F.otype.v(n)) for n in N())
edgeFeatures['oslots'] = dict((n, L.d(n, 'word')) for n in N() if F.otype.v(n) != 'word')

#### Merge Adjacent TP's

There are several cases in BHSA where time phrases are divided up into 2, 3, or even 4 pieces, whereas elsewhere the parts are kept together as a single phrase. This is an undesirable inconsistency. To solve this problem, new phrase boundaries are generated and mapped over the old boundaries stored in the oslots file. For all of the cases that are remapped, a print-out confirms the new slots.

In [16]:
first_tp = set(res[1] for res in A.search('''

% find all cases of time phrases followed by, 
% but not preceded by, another time phrase

clause
    phrase function=Time
    /without/
    clause
        phrase function=Time
        <: ..
    /-/
    <: phrase function=Time
''', silent=True))

delete_tp = set(res[2] for res in A.search('''

% find all cases of time phrases preceded by
% another time phrase for deletion

clause
    phrase function=Time
    <: phrase function=Time

''', silent=True))

oldmaxotype = max(nodeFeatures['otype'].keys())
maxotype = oldmaxotype+1
new_phrase = set()

for phrase in F.otype.s('phrase'):
    
    if phrase not in first_tp|delete_tp:
        edgeFeatures['oslots'][maxotype] = L.d(phrase, 'word')
        nodeFeatures['otype'][maxotype] = 'phrase2'
        nodeFeatures['function'][maxotype] = F.function.v(phrase)
        maxotype += 1
    
    elif phrase in first_tp: 
        new_slots = list(L.d(phrase, 'word')) # compile new slots here
        this_phrase, this_clause = phrase, L.u(phrase, 'clause')[0] # this_phrase iterates +1 each loop, this_clause does not
        
        # gather all slots in subsequent time phrases
        while (F.function.v(this_phrase+1) == 'Time')\
            and this_phrase+1 in L.d(this_clause, 'phrase'): # subsequent TP must also be in same clause
            new_slots.extend(L.d(this_phrase+1, 'word'))
            this_phrase = this_phrase+1

        edgeFeatures['oslots'][maxotype] = new_slots
        nodeFeatures['otype'][maxotype] = 'phrase2'
        nodeFeatures['note'][maxotype] = 'new phrase by phrase merge'
        nodeFeatures['function'][maxotype] = 'Time'
        new_phrase.add(maxotype)
        maxotype+=1
        
    elif phrase in delete_tp: # delete by skipping
        print(f"skipping over tp {phrase} {T.text(phrase)}")
        print(f"\tin {T.text(L.u(phrase,'clause')[0])}")
        print()
        continue

skipping over tp 653402 בַּחֹ֨דֶשׁ֙ הַשֵּׁנִ֔י 
	in בִּשְׁנַ֨ת שֵׁשׁ־מֵאֹ֤ות שָׁנָה֙ לְחַיֵּי־נֹ֔חַ בַּחֹ֨דֶשׁ֙ הַשֵּׁנִ֔י בְּשִׁבְעָֽה־עָשָׂ֥ר יֹ֖ום לַחֹ֑דֶשׁ בַּיֹּ֣ום הַזֶּ֗ה נִבְקְעוּ֙ כָּֽל־מַעְיְנֹת֙ תְּהֹ֣ום רַבָּ֔ה 

skipping over tp 653403 בְּשִׁבְעָֽה־עָשָׂ֥ר יֹ֖ום לַחֹ֑דֶשׁ 
	in בִּשְׁנַ֨ת שֵׁשׁ־מֵאֹ֤ות שָׁנָה֙ לְחַיֵּי־נֹ֔חַ בַּחֹ֨דֶשׁ֙ הַשֵּׁנִ֔י בְּשִׁבְעָֽה־עָשָׂ֥ר יֹ֖ום לַחֹ֑דֶשׁ בַּיֹּ֣ום הַזֶּ֗ה נִבְקְעוּ֙ כָּֽל־מַעְיְנֹת֙ תְּהֹ֣ום רַבָּ֔ה 

skipping over tp 653404 בַּיֹּ֣ום הַזֶּ֗ה 
	in בִּשְׁנַ֨ת שֵׁשׁ־מֵאֹ֤ות שָׁנָה֙ לְחַיֵּי־נֹ֔חַ בַּחֹ֨דֶשׁ֙ הַשֵּׁנִ֔י בְּשִׁבְעָֽה־עָשָׂ֥ר יֹ֖ום לַחֹ֑דֶשׁ בַּיֹּ֣ום הַזֶּ֗ה נִבְקְעוּ֙ כָּֽל־מַעְיְנֹת֙ תְּהֹ֣ום רַבָּ֔ה 

skipping over tp 653558 בְּשִׁבְעָה־עָשָׂ֥ר יֹ֖ום לַחֹ֑דֶשׁ 
	in וַתָּ֤נַח הַתֵּבָה֙ בַּחֹ֣דֶשׁ הַשְּׁבִיעִ֔י בְּשִׁבְעָה־עָשָׂ֥ר יֹ֖ום לַחֹ֑דֶשׁ עַ֖ל הָרֵ֥י אֲרָרָֽט׃ 

skipping over tp 653566 בְּאֶחָ֣ד לַחֹ֔דֶשׁ 
	in בָּֽעֲשִׂירִי֙ בְּאֶחָ֣ד לַחֹ֔דֶשׁ נִרְא֖וּ רָאשֵׁ֥י הֶֽהָרִֽים׃ 

skipping over tp 653660 בָּֽרִ

In [17]:
print('new phrases: ')
for np in sorted(new_phrase):
    print(np, T.text(edgeFeatures['oslots'][np]))

new phrases: 
1448659 בִּשְׁנַ֨ת שֵׁשׁ־מֵאֹ֤ות שָׁנָה֙ לְחַיֵּי־נֹ֔חַ בַּחֹ֨דֶשׁ֙ הַשֵּׁנִ֔י בְּשִׁבְעָֽה־עָשָׂ֥ר יֹ֖ום לַחֹ֑דֶשׁ בַּיֹּ֣ום הַזֶּ֗ה 
1448812 בַּחֹ֣דֶשׁ הַשְּׁבִיעִ֔י בְּשִׁבְעָה־עָשָׂ֥ר יֹ֖ום לַחֹ֑דֶשׁ 
1448819 בָּֽעֲשִׂירִי֙ בְּאֶחָ֣ד לַחֹ֔דֶשׁ 
1448912 בְּאַחַ֨ת וְשֵׁשׁ־מֵאֹ֜ות שָׁנָ֗ה בָּֽרִאשֹׁון֙ בְּאֶחָ֣ד לַחֹ֔דֶשׁ 
1448927 בַחֹ֨דֶשׁ֙ הַשֵּׁנִ֔י בְּשִׁבְעָ֧ה וְעֶשְׂרִ֛ים יֹ֖ום לַחֹ֑דֶשׁ 
1449273 אַחַ֣ר הַמַּבּ֑וּל שְׁלֹ֤שׁ מֵאֹות֙ שָׁנָ֔ה וַֽחֲמִשִּׁ֖ים שָׁנָֽה׃ 
1464032 מֵעֹודִ֖י עַד־הַיֹּ֥ום הַזֶּֽה׃ 
1465793 גַּ֤ם מִתְּמֹול֙ גַּ֣ם מִשִּׁלְשֹׁ֔ם גַּ֛ם מֵאָ֥ז 
1468421 עַ֣ד אַרְבָּעָ֥ה עָשָׂ֛ר יֹ֖ום לַחֹ֣דֶשׁ הַזֶּ֑ה 
1468583 בְּאַרְבָּעָה֩ עָשָׂ֨ר יֹ֤ום לַחֹ֨דֶשׁ֙ 
1468587 עַ֠ד יֹ֣ום הָאֶחָ֧ד וְעֶשְׂרִ֛ים לַחֹ֖דֶשׁ 
1477660 בְּיֹום־הַחֹ֥דֶשׁ הָרִאשֹׁ֖ון בְּאֶחָ֣ד לַחֹ֑דֶשׁ 
1477796 בַּחֹ֧דֶשׁ הָרִאשֹׁ֛ון בַּשָּׁנָ֥ה הַשֵּׁנִ֖ית בְּאֶחָ֣ד לַחֹ֑דֶשׁ 
1479324 כָּל־הַלַּ֨יְלָה֙ עַד־הַבֹּ֔קֶר 
1480264 שִׁבְעַ֣ת יָמִ֔ים עַ֚ד יֹ֣ום 
1480282 יֹומָ֤ם וָלַ֨יְלָה֙ שִׁבְעַ֣ת

Check for gaps. Every slot should be inside a `phrase2`.

In [18]:
coveredslots = sorted(slot for ph, slots in edgeFeatures['oslots'].items()
                          for slot in slots
                          if ph > oldmaxotype)

for i, slot in enumerate(coveredslots):
    if i+2 > len(coveredslots):
        print('no problems found!')
        break
    if slot+1 != coveredslots[i+1]:
        raise Exception(f'{slot}, {coveredslots[i+1]}')

no problems found!


## Complex Cases

Some cases will require further research. These are marked with a simple note: "complex"

In [19]:
nodeFeatures['note'][846434] = 'complex'

## Export New Object Data to TF

In [20]:
featuredir = '../../data'

phrasefunctions = dict((ph, F.function.v(ph)) for ph in F.otype.s('phrase'))

meta = {'':{'created_by': 'Cody Kingham, Dirk Roorda, and Constantijn Sikkel (ETCBC)',
            'coreData':'BHSA',
            'coreVersion':'c',},
        'oslots': {'valueType':'int', 
                   'edgeValues':False},
        'otype':{'valueType':'str'},
        'note':{'valueType':'str',
                'description':'notes on objects for tracking issues throughout my research'},
        'function': {'valueType':'str', 
                     'description':'This is a corrected version of original BHSA phrase functions. Information about the edits can be found by finding an F.note feature on a phrase that mentions the remapping'}
       }

TFsave = Fabric(locations=featuredir)

TFsave.save(nodeFeatures=nodeFeatures, edgeFeatures=edgeFeatures, metaData=meta)

This is Text-Fabric 7.4.11
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

4 features found and 0 ignored
  0.00s Warp feature "otext" not found. Working without Text-API

  0.00s Exporting 3 node and 1 edge and 0 config features to /Users/cody/github/csl/time_collocations/analysis/../data:
  0.00s VALIDATING oslots feature
  0.13s maxSlot=     426584
  0.14s maxNode=    1699934
  0.39s OK: oslots is valid
   |     0.71s T function             to /Users/cody/github/csl/time_collocations/analysis/../data
   |     0.00s T note                 to /Users/cody/github/csl/time_collocations/analysis/../data
   |     0.72s T otype                to /Users/cody/github/csl/time_collocations/analysis/../data
   |     3.40s T oslots               to /Users/cody/github/csl/time_collocations/analysis/../data
  5.23s Exported 3 node features and 1 edge features and 0 config features to /Users/cody/github/csl/time_collocations/analysis/../data


True

## Exploring New Objects

In [25]:
locations = ['/Users/cody/text-fabric-data/etcbc/bhsa/tf/c/',
             '../../data/',]

TF = Fabric(locations=locations)
api = TF.load('''

vs vt pdp gloss lex language rela typ number
function

''')

B = use('bhsa', api=api)

This is Text-Fabric 7.4.11
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

115 features found and 3 ignored
  0.00s loading features ...
   |     0.11s B lex                  from /Users/cody/text-fabric-data/etcbc/bhsa/tf/c
   |     0.13s B vs                   from /Users/cody/text-fabric-data/etcbc/bhsa/tf/c
   |     0.13s B vt                   from /Users/cody/text-fabric-data/etcbc/bhsa/tf/c
   |     0.13s B pdp                  from /Users/cody/text-fabric-data/etcbc/bhsa/tf/c
   |     0.15s B gloss                from /Users/cody/text-fabric-data/etcbc/bhsa/tf/c
   |     0.13s B language             from /Users/cody/text-fabric-data/etcbc/bhsa/tf/c
   |     0.23s B rela                 from /Users/cody/text-fabric-data/etcbc/bhsa/tf/c
   |     0.22s B typ                  from /Users/cody/text-fabric-data/etcbc/bhsa/tf/c
   |     0.24s B number               from /Users/cody/text-fabric-data/etcbc/bhsa/tf/c
   |     0.17s B function             from /Users

In [31]:
B.show(B.search('''

book book@en=2_Chronicles
    chapter chapter=21
        verse verse=19
            phrase2 function=Time
                word

'''), condenseType='clause', condensed=True)

  0.69s 6 results
