# Classifying BH Time Constructions

There are three basic constructions:

* non-prepositional time constructions
* prepositional time constructions
* multiphrasal time constructions

## Load Modules and Data

In [1]:
# Text-Fabric processor and tools
from tf.fabric import Fabric
from tf.app import use
from tools.locations import data_locations

# stats & data-containers
import collections, random, csv, re
import pandas as pd
import numpy as np
import scipy.stats as stats
from tools.significance import contingency_table, apply_fishers
from tools.pca import plot_PCA
from tools.helpers import convert2pandas
from sklearn.decomposition import PCA

# data visualizations
from tools.visualize import reverse_hb, barplot_counts
import seaborn as sns
sns.set(font_scale=1.5, style='whitegrid')
import matplotlib.pyplot as plt
from IPython.display import display, HTML

# load custom BHSA data + heads
TF = Fabric(locations=data_locations.values())
load_features = ['g_cons_utf8', 'trailer_utf8', 'label', 'lex',
                 'role', 'rela', 'typ', 'function', 'language',
                 'pdp', 'gloss', 'vs', 'vt', 'nhead', 'head', 
                 'mother', 'nu', 'prs', 'sem_set', 'ls', 'st',
                 'kind', 'top_assoc', 'number']
api = TF.load(' '.join(load_features))
F, E, T, L = api.F, api.E, api.T, api.L # shortform TF methods

 # configure Hebrew displaying
A = use('bhsa', api=api, silent=True)
A.displaySetup(condenseType='clause', withNodes=True)

# import TF-dependent tools
from tools.tokenize import tokenize_surface
from tools.time import Time

This is Text-Fabric 7.8.4
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

123 features found and 4 ignored
  0.00s loading features ...
   |     0.00s No structure info in otext, the structure part of the T-API cannot be used
  7.15s All features loaded/computed - for details use loadLog()


# Basic Exploration

The analysis looks at chunk objects with `label=timephrase`. Aramaic time constructions are excluded. Below we print the total number of such objects.

In [2]:
times = A.search('''

chunk label=timephrase

''', shallow=True)

  0.08s 3881 results


In [3]:
surfaces = collections.Counter()

for cx in times:
    surface_token = tokenize_surface(cx)
    surfaces[surface_token] += 1
    
surfaces = convert2pandas(surfaces)

In [4]:
print(f'{surfaces.shape[0]} unique surface forms found')

1167 unique surface forms found


In [5]:
surfaces.head(50)

Unnamed: 0,Total
עתה,342
ב.ה.יום.ה.הוא,203
ה.יום,191
ל.עולם,85
ב.ה.בקר,78
עד.ה.יום.ה.זה,71
ב.יום,69
אז,66
שׁבעת.ימים,63
עד.עולם,53


This top list accounts for a substantial proportion of all known time adverbials in the dataset:

In [6]:
surfaces.head(50).sum()[0] / surfaces.sum()[0]

0.545735635145581

The >50% representation accounted for in the top 50/~1100 forms shows that this surface count table contains most of the key constructional elements for a TIME taxonomy.

## Distinguish Single Phrase and Multi-Phrasal Time Constructions

### Single Phrase Constructions

In [7]:
singles = []
multi = []

for tc in F.label.s('timephrase'):
    
    nphrases = L.d(tc, 'phrase')
    nphraseatoms = L.d(tc, 'phrase_atom')
    times = [r for r in E.role.t(tc) if r[1]=='time']

    if all([len(nphrases) == 1, 
            len(nphraseatoms) == 1,
            len(times) == 1]):
        singles.append(tc)
        
    else:
        multi.append(tc)
        
print(f'{len(singles)} single phrasal constructions found...')
print(f'{len(multi)} multi phrasal constructions found...')

3352 single phrasal constructions found...
529 multi phrasal constructions found...


I will manual inspect single cases that are randomly selected. I eliminate cases wherein a time construction consists of only a single word.

In [8]:
# inspect = [r for r in singles if len(L.d(r[0], 'word')) > 1]
# print(len(inspect), 'cases ready for manual inspection...')

In [9]:
# random.shuffle(multi)

In [10]:
# A.show(multi)

# Classifications

The classifications are added to a dictionary, keyed by name, to keep track of which constructions are accounted for. The process is one of elimination and deduction based on features identified in an inductive analysis.

In [11]:
classes = {}
total_cxs = len(singles) + len(multi)

In [12]:
def percent(n, total):
    return round(n/total, 2)

def prog():
    '''Reports progress in %'''
    found = set(cx for cls, clset in classes.items() for cx in clset)
    ratio = percent(len(found), total_cxs)
    print(f'\t{ratio} now accounted for')

class CXdata:
    '''
    Makes count and result data available on
    a given construction class. The class is
    identified through a supplied search function.
    '''
    def __init__(self, cx_set, validate, tokenize):
        
        self.timeset = set()
        counts = collections.Counter()
        self.results = collections.defaultdict(list)
    
        for cx in cx_set:
            time = Time(cx)
            if validate(time):
                count_text = tokenize(time)
                counts[count_text] += 1
                self.timeset.add(cx)
                self.results[count_text].append([cx]+list(L.d(cx, 'word')))
                
        self.counts = convert2pandas(counts)
        
        print(f'{len(self.timeset)} matching constructions found...')

## Single Phrase, Non-Prepositional

### The Non-Prepositional (ø) "Adverb Construction"


The adverb construction is a single-word construction that is anchored to speech time. The lack of an specifiers such as demonstratives, definite articles, and other specifications means that these words must necessarily be anchored to an implied common point. This gives rise to a construction that can be considered an "adverb construction." Is this phenomenon discussed in the literature? Do other scholars recognize the association between modifiers and anchoredness?

These constructions are selected simply by finding those with only one specification: `time`.

In [13]:
def check_advb(time):
    '''Checks for adverb status'''
    if time.tag == 'time':        
        return True

def token_advb(time):
    '''Tokenizes adverb cx'''
    text = tokenize_surface(time.cx)
    return text

np_adverbs = CXdata(singles, check_advb, token_advb)

classes['ø_adverb'] = np_adverbs.timeset

prog()

606 matching constructions found...
	0.16 now accounted for


In [14]:
np_adverbs.counts

Unnamed: 0,Total
עתה,342
אז,66
לילה,41
אחר,34
מחר,31
תמיד,30
יומם,20
עולם,6
טרם,4
בקר,4


In [15]:
A.show(np_adverbs.results['חדשׁ'], condenseType='sentence')

**TODO**: NB the example above. Is it indeed an adverbial construction, or a simple use of the durative?

#### [Research and defense of this construction here]

### The ø "Attributive Anchor" Construction, [H + time + H + anchor]

This construction takes advantage of the  noun construction. It frequently occurs with the demonstrative, in which case the demonstrative serves to anchor the time word.


In [16]:
def check_atanchor(time):
    '''Validates attributive anchor construction'''
    if 'attr_patt' in time.specs:
        return True
    
def token_atanchor(time):
    '''Tokenize attributive anchor cx'''
    thetime = time.times[0]
    attrib = thetime + 2
    return F.g_cons_utf8.v(attrib)

attimes = CXdata(singles, check_atanchor, token_atanchor)

classes['ø_attrib_anchor'] = attimes.timeset

prog()

639 matching constructions found...
	0.32 now accounted for


In [17]:
attimes.counts

Unnamed: 0,Total
הוא,246
זה,138
היא,44
שׁביעי,40
הם,31
שׁלישׁי,25
אלה,15
שׁני,14
שׁמיני,14
ראשׁון,11


In [18]:
A.show(attimes.results['רבים'])

#### [Research and defense of this construction here]

### The "Quantified Durative" Construction

In [19]:
def qd_validate(time):
    '''
    Validates a quantified durative cx
    Selects quants but excludes
        • qualitatives
        • the number "one"
    '''

    quant_lexs = set(F.lex.v(q) for q in time.quants) # to check for אחד
    if all(['time' in time.specs, 
            'quant' in time.specs,  
            'card' in time.specs,
            quant_lexs - {'>XD/'}]):    
        return True
    
def token_qd(time):
    '''Tokenizes quantative durative cxs'''
    time_text = F.lex_utf8.v(time.times[0])
    return time_text
    
qdtimes = CXdata(singles, qd_validate, token_qd)
        
classes['quant_durative'] = qdtimes.timeset
        
prog()

288 matching constructions found...
	0.4 now accounted for


In [20]:
qdtimes.counts

Unnamed: 0,Total
שׁנה,147
יום,120
חדשׁ,18
שׁבוע,2
ירח,1


In [21]:
A.show(qdtimes.results['שׁבוע'])

#### [Research and defense of this construction here]

### The Case of Qualitative Quantifiers



In [22]:
def val_quals(time):
    '''Validates qualitative quantifier constructions'''
    if all(['time' in time.specs, 
            'qual' in time.specs, 
            'H' not in time.specs]):
        return True
    
def tok_qual(time):
    '''Toknizes qual quants with the quantifier'''
    return F.lex_utf8.v(time.quants[0])

qual_quants = CXdata(singles, val_quals, tok_qual)

classes['quant_durative'] |= qual_quants.timeset

prog()

88 matching constructions found...
	0.42 now accounted for


In [23]:
qual_quants.counts

Unnamed: 0,Total
כל,63
רב,19
מספר,3
חצות,2
יתר,1


Randomized inspection of results below...

In [24]:
# random.shuffle(qual_quants.results['כל'])

# A.show(qual_quants.results['כל'])

The cases of חצות and יתר are interesting. Do they bear the same semantics as the others?

In [25]:
A.show(qual_quants.results['חצות'])

יתר seems more straightforward to indeed align with the quantified durative expression:

In [26]:
A.show(qual_quants.results['יתר'])

### The Cases of qualQuant + H + time 

In [27]:
def val_qualHquant(time):
    '''Validates qualitative quantifiers with H + time'''
    if all(['time' in time.specs,
            'qual' in time.specs,
            'H' in time.specs
           ]):
        return True
    
def token_qhquant(time):
    '''Tokenizes a QualH quant'''
    quant = time.quants[0]
    nxt_word = quant + 1
    return f'{F.lex_utf8.v(quant)}.{F.lex_utf8.v(nxt_word)}'

qualH_quants = CXdata(singles, val_qualHquant, token_qhquant)

classes['quant_durative'] |= (qualH_quants.timeset)

prog()

102 matching constructions found...
	0.44 now accounted for


In [28]:
qualH_quants.counts

Unnamed: 0,Total
כל.ה,100
מספר.ה,1
חצי.ה,1


In [29]:
#A.show(qualH_quants.results['כל.ה'])

These cases do, on initial inspection, seem to be durative as well.

#### [Research and defense of this construction here]

### Cases Where the Quantifier is אחד

In [30]:
def qd_validate_ONE(time):
    '''
    Validates a quantified durative cx
    WITH אחד
    '''

    quant_lexs = set(F.lex.v(q) for q in time.quants) # to check for אחד
    if all(['time' in time.specs, 
            'quant' in time.specs,  
            'card' in time.specs,
            '>XD/' in quant_lexs,
            not quant_lexs - {'>XD/'}
           ]):    
        return True
    
# token_qd, from above, will be used

qd_ones = CXdata(singles, qd_validate_ONE, token_qd)

classes['quant_durative_ones'] = qd_ones.timeset

prog()

12 matching constructions found...
	0.45 now accounted for


In [31]:
qd_ones.counts

Unnamed: 0,Total
יום,7
שׁנה,3
רגע,1
שׁבוע,1


In [32]:
A.show(qd_ones.results['יום'])

These cases are ambiguous and need to be more thoroughly researched...

**One potential distinguishing marker is whether the time noun is construed / profiled as a "large" time or a "short" time. For instance, רגע is more evidently punctiliar:**

In [33]:
A.show(qd_ones.results['רגע'])

On the other hand שׁנה, a longer period of time, seems unanimously durative:

In [34]:
A.show(qd_ones.results['שׁנה'])

The intuition that the construed time size affects the interpretation needs to be further explored.

#### [Research and defense of this construction here]

### The Demonstrative Heh Construction

This construction is unique, since the definite article heh is construed as a demonstrative. This is not a typical function of the definite article in noun phrases. It uniquely takes on this role in time constructions.

In [35]:
def val_demheh(time):
    '''Validates a demonstrative heh cx'''
    if time.tag == 'time.H':
        return True
    
def token_demheh(time):
    '''Tokenizes demonstrative heh cx'''
    this_time = time.times[0]
    heh = this_time - 1
    return f'{F.lex_utf8.v(heh)}.{F.lex_utf8.v(this_time)}'

demheh = CXdata(singles, val_demheh, token_demheh)

classes['demon_heh'] = demheh.timeset

prog()

220 matching constructions found...
	0.5 now accounted for


In [36]:
demheh.counts

Unnamed: 0,Total
ה.יום,194
ה.לילה,22
ה.שׁנה,3
ה.שׁביעי,1


Random inspection of mass results...

In [37]:
# random.shuffle(demheh.results['ה.יום'])

# A.show(demheh.results['ה.יום'])

Inspection of rare results...

In [38]:
A.show(demheh.results['ה.שׁנה'])

### NEXT

In [39]:
remaining = [L.d(cx, 'word') for cx in singles if cx not in set(c for cl, cs in classes.items() for c in cs)
                if Time(cx).tag.startswith('time.')
            ]

In [40]:
len(remaining)

42

In [41]:
A.show(remaining)