# Time Adverbial Surface Tokens

The starting point of the time adverbial study is on the distinct surface forms found amongst all phrases marked in the [BHSA](http://www.github.com/etcbc/bhsa) for that function. The BHSA dataset is taken as the point of departure since it contains rich syntactic annotations for words, phrases, clauses, and sentences. The dataset contains argument functions for phrases, including a tag for adverbial time indication. These cases are marked with a `function=Time` value. These phrases have been preprocessed in the [pipeline](../data/pipeline/readme.md) with corrections and enhancements that aid the analysis. The most important enhancement is the addition of a new object to the dataset, a `chunk`. The `chunk` corrects for cases where BHSA divides a single time adverbial phrase into separate, adjacent phrases. These cases are anomalous when compared to other situations where they are kept together. The `chunk`, as with the phrase, does not capture all elements involved in a time adverbial construction, which can be comprised of units at various levels of hierarchies, such as a phrase with a subordinated clause. But it does provide a firm starting point from which to learn the general tendencies of time adverbials.

A surface form token is the most basic method of identifying recurring tendencies in adverbial time expression. A token is defined as a simple sequence of words that begin and end at the boundaries of a chunk with `label=timephrase`. An example can be seen in the first phrase of Genesis 1:1\:

> בְּרֵאשִׁית

This token consists of the two-word sequence, ב + ראשׁית. Stripping the string of accentuation allows the possibility to match this token with other instances:

> בראשׁית &nbsp;&nbsp;&nbsp;&nbsp;(Gen 1:1) <br>
> בראשׁית &nbsp;&nbsp;&nbsp;&nbsp;(Jer 26:1) <br>
> בראשׁית &nbsp;&nbsp;&nbsp;&nbsp;(Jer 27:1) <br>
> בראשׁית &nbsp;&nbsp;&nbsp;&nbsp;(Jer 28:1) <br>
> בראשׁית &nbsp;&nbsp;&nbsp;&nbsp;(Jer 49:34) <br>

By matching these surface tokens, we can construct a primitive `timephrase` type. The בראשית type thus has a frequency of 5. Note that Hos 9:10 is not matched since its token is different, containing a suffix בראשׁיתה. Yet, this instance is clearly related to the other 5 cases. This notebook will not yet explore the comparison and linking of similar elements. But will focus on the primary tendencies seen amongst dominating token strings.

By counting token string frequencies, we are able to sort from most to least frequent. This in turn allows us to identify the most recurrent sequences in the dataset.

## Load Modules and Data

In [47]:
# Text-Fabric processor and tools
from tf.fabric import Fabric
from tf.app import use
from tools.locations import data_locations
from tools.time import Time

# stats & data-containers
import collections, random, csv, re
import pandas as pd
import numpy as np
import scipy.stats as stats
from tools.significance import contingency_table, apply_fishers
from tools.pca import plot_PCA
from tools.helpers import convert2pandas
from sklearn.decomposition import PCA

# data visualizations
from tools.visualize import reverse_hb, barplot_counts
import seaborn as sns
sns.set(font_scale=1.5, style='whitegrid')
import matplotlib.pyplot as plt
from IPython.display import display, HTML

# load custom BHSA data + heads
TF = Fabric(locations=data_locations.values())
load_features = ['g_cons_utf8', 'trailer_utf8', 'label', 'lex',
                 'role', 'rela', 'typ', 'function', 'language',
                 'pdp', 'gloss', 'vs', 'vt', 'nhead', 'head', 
                 'mother', 'nu', 'prs', 'sem_set', 'ls', 'st',
                 'kind']
api = TF.load(' '.join(load_features))
F, E, T, L = api.F, api.E, api.T, api.L # shortform TF methods

 # configure Hebrew displaying
A = use('bhsa', api=api)
A.displaySetup(condenseType='clause')

# import TF-dependent tools
from tools.tokenize import tokenize_surface

This is Text-Fabric 7.7.3
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

123 features found and 4 ignored
  0.00s loading features ...
   |     0.00s Not enough info for structure in otext, structure functionality will not work
   |     0.19s B g_cons_utf8          from /Users/cody/text-fabric-data/etcbc/bhsa/tf/c
   |     0.12s B lex                  from /Users/cody/text-fabric-data/etcbc/bhsa/tf/c
   |     0.09s B trailer_utf8         from /Users/cody/text-fabric-data/etcbc/bhsa/tf/c
   |     0.02s B label                from /Users/cody/github/csl/time_collocations/data/tf
   |     0.02s B role                 from /Users/cody/github/csl/time_collocations/data/tf
   |     0.28s B rela                 from /Users/cody/text-fabric-data/etcbc/bhsa/tf/c
   |     0.19s B typ                  from /Users/cody/text-fabric-data/etcbc/bhsa/tf/c
   |     0.06s B function             from /Users/cody/github/csl/time_collocations/data/tf
   |     0.14s B language        

# Basic Exploration

The analysis looks at chunk objects with `label=timephrase`. Below we print the total number of such objects.

In [2]:
times = A.search('''

chunk label=timephrase
/with/
    word language=Hebrew
/-/

''')

  0.58s 3881 results


## Surface Form Analysis

This analysis aims to see common tendencies amongst the various time adverbial instances. The constructional principle of non-synonymy states that two constructions which have distinct surface forms cannot be identifical in meaning (Goldberg 1995). Even small differences in surface form indicate variations in semantic or pragmatic meaning. The primary constructional patterns of time adverbials can be seen by applying a tokenization clustering strategy

Tokenization clustering is a data-oriented way to create the sets of identical forms. The strategy generates a surface token of each `timephrase chunk` by stripping the surface text of vowels, accents, and spacing. The tokenizer adds consonantal ה in cases where an article is vocalized. This process allows similar forms to cluster despite minor differences. The counts are produced and visualized below.

In [56]:
surfaces = collections.Counter()

for cx in times:
    cx = cx[0]
    surface_token = tokenize_surface(cx)
    surfaces[surface_token] += 1
    
surfaces = convert2pandas(surfaces)

In [57]:
print(f'{count_timechunks.shape[0]} unique surface forms found')

1167 unique surface forms found


In [58]:
surfaces.head(50)

Unnamed: 0,Total
עתה,342
ב.ה.יום.ה.הוא,203
ה.יום,191
ל.עולם,85
ב.ה.בקר,78
עד.ה.יום.ה.זה,71
ב.יום,69
אז,66
שׁבעת.ימים,63
עד.עולם,53


This top list accounts for a substantial proportion of all known time adverbials in the dataset:

In [63]:
surfaces.head(50).sum()[0] / surfaces.sum()[0]

0.545735635145581

The >50% representation accounted for in the top 50/~1100 forms shows that this surface count table contains most of the key constructional elements for a TIME taxonomy.

A number of key tendencies can be seen amongst the surface forms. Namely, time phrases are typically prepositional or not. They also contain other specifications such as demonstratives, quantifiers (both quantitative and qualitative), plurals, and construct elements. Some forms, namely adverbial forms such as עולם, עתה, אז, and others occur with no apparent specifications. In an [earlier study](exploratory/time_constructions.ipynb), these specifications were shown to be significant. The primary procedure developed in that analysis is now contained in the tools.time module, mainly with a class `Time` which contains attributes related to the specificiers found inside a given time construction. Those specifiers can now be analyzed.

# Tagging Time Specifications

In [53]:
Time(1446840).tag

'PPtime.H.pl.quant.card'