# Time Constructions, Part 2

In part 1 of time construction analysis, I used exploratory analysis to find the top major forms amongst time constructions in Biblical Hebrew. That analysis culminated in a tokenization strategy which labeled like-elements within time constructions to obtain raw groups. Those groups were counted, and it was found that out of 312 raw different surface forms, the top 11 attested for 75% of individual instances. These few forms thus encapsulate a majority of the data.

In this notebook, I want to break down the major subcategories of the top surface forms. For instance, the most common token is `prep.time`, i.e. a preposition + a time word. But there are major differences amongst this group. Specifically, the time noun is often specified by a further element. In some cases this further element consists of an infinitival clause that modifies the time noun. Some time nouns are statistically associated with the time function, such as יום, שׁנה etc. But some are not, such as nouns which describe events. These are of a different semantic type.

This analysis will follow a similar strategy as part 1, using a process of elimination to narrow down the primary groups amongst the data. 

In [None]:
import collections, csv, random, os
import pandas as pd
import numpy as np
import seaborn as sns
sns.set(font_scale=1.5, style='whitegrid')
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from skfuzzy.cluster import cmeans
from tf.fabric import Fabric
from tf.app import use

# import custom tools
os.sys.path.append('../')
from tools.locations import data_locations 
from tools.significance import contingency_table, apply_fishers
from tools.pca import plot_PCA
from tools.helpers import convert2pandas
from tools.visualize import reverse_hb
from tools.tokenize import tokenize_surface
from tools.time import Time

TF = Fabric(locations=data_locations.values())
api = TF.load('''

vs vt pdp gloss lex language ps gn
rela typ number function prs
g_cons_utf8 lex_utf8 nu mother st uvf
g_word_utf8 trailer_utf8 voc_lex_utf8
head nhead obj_prep sem_set
ls top_assoc funct_assoc kind txt
label role
''')

A = use('bhsa', api=api, hoist=globals(), silent=True)

A.displaySetup(condenseType='clause', condensed=True, withNodes=True)

# Helper Functions

In [None]:
firstyear = '../../data/paper_data/firstyear2' # directory for first year review paper saves

def countBarplot(count_df, 
                 title='', 
                 column='Total', 
                 reverse_labels=False, 
                 size=(8, 6),
                 xlab_rotation=None,
                 ylim=None,
                 save=None,
                 xlabel=None,
                ):
    '''
    Makes simple barplot from collections.Counter type objects.
    '''
    n_bars = list(range(0, count_df.shape[0]))
    x_labels = [''.join(reversed(prep)) for prep in count_df.index] if reverse_labels else count_df.index
    plt.figure(figsize=size)
    sns.barplot(n_bars, count_df[column], color='darkblue')
    plt.xticks(n_bars, x_labels, size=18, rotation=xlab_rotation)
    plt.yticks(size=18)
    if ylim:
        plt.ylim(top=ylim[0], bottom=ylim[1])
    if xlabel:
        plt.xlabel(xlabel,size=18)
    plt.ylabel(column, size=18)    
    if save:
        plt.savefig(save, dpi=300, bbox_inches='tight')
    plt.title(title, size=18,  y=1.05)
    plt.show()

# Helper Data

In [None]:
# label2result = collections.defaultdict(list)
# for cx in F.otype.s('chunk'):
#     label2result[F.label.v(cx)].append(L.d(cx, 'phrase'))

# Time Constructions: Their Distribution and Make-Up

This analysis repeats some parts of previous studies, [SBH_time_expressions](https://nbviewer.jupyter.org/github/CambridgeSemiticsLab/BH_time_collocations/blob/master/analysis/SBH_time_expressions.ipynb) and [duratives](https://nbviewer.jupyter.org/github/CambridgeSemiticsLab/BH_time_collocations/blob/master/analysis/duratives.ipynb), but now with the new time construction data and for all known time constructions in the Hebrew Bible.

### Basic BSHA Stats

#### number of phrases in BHSA

In [None]:
phrases = A.search('''

phrase
/with/
    word language=Hebrew
/-/

''')

#### number of functions in BHSA

In [None]:
len(F.function.freqList())

In [None]:
F.function.freqList()

Hebrew time phrases (not processed and raw).

In [None]:
tp = A.search('''

phrase function=Time
/with/
    word language=Hebrew
/-/
    
''')

After the post-processing...

In [None]:
time_cx = A.search('''

construction
    
''')

## Time Construction Distribution and Selectivity

### Distribution of Time Constructions Across Corpus

In [None]:
strip_data = []
covered_chapters = set()
bookboundaries = {}

twelve = ('Hosea', 'Joel', 'Amos', 'Obadiah',
          'Jonah', 'Micah', 'Nahum', 'Habakkuk',
          'Zephaniah', 'Haggai', 'Zechariah',
          'Malachi')

# map grouped book names
megilloth = ('Ruth', 'Lamentations', 'Ecclesiastes', 'Esther', 'Song_of_songs')
book_map = {'1_Kings': 'Kings', '2_Kings':'Kings', '1_Samuel':'Samuel',
            '2_Samuel':'Samuel', '1_Chronicles':'Chronicles', '2_Chronicles':'Chronicles',}
for book in twelve: book_map[book] = 'Twelve'
for book in megilloth: book_map[book] = 'Megilloth'
for book in ('Ezra', 'Nehemiah', 'Daniel'): book_map[book] = 'Daniel-Neh'

    
# iterate through constructions and gather book data
this_book = None

for cx in F.otype.s('construction'):
    chapter_node = L.u(cx, 'chapter')[0]
    book, chapter, verse = T.sectionFromNode(cx)
    this_book = book_map.get(book, book)
    covered_chapters.add(chapter_node)
    chapter_label = len(covered_chapters)
    
    if this_book not in bookboundaries: # add first chapter to boundaries for plotting
        bookboundaries[this_book] = chapter_label
    
    strip_data.append(chapter_label)

In [None]:
strip_title = 'Distribution of Time Function Constructions by Chapter (smaller books are grouped together)'
plt.figure(figsize=(20, 6))
sns.stripplot(x=strip_data, jitter=0.3, color='darkblue')
plt.xticks(ticks=list(bookboundaries.values()), labels=list(bookboundaries.keys()), rotation='vertical', size=20)
plt.savefig('paper_data/firstyear/chapter_distribution.png', dpi=300, bbox_inches='tight')
print(strip_title) # keep title out of savefig
plt.show()

### Degree of Dispersion Compared to Other Functions

The strip chart gives a good sense of how spread out time constructions are in the Hebrew Bible. We can also see that the distribution is sparser throughout the poetics books, from Isaiah until the beginning of Daniel-Nehemiah. This variation in density can be quantified using a statistical measure known as **degree of dispersion** (Gries, S. 2008. "Dispersions and Adjusted Frequencies in Corpora"). We can use this measure to compare the time construction against other functions in corpus.

In [None]:
# count all phrase tokens per book
phrase_functions = collections.defaultdict(lambda:collections.Counter())

functionmap = {'PreO': 'Pred', 'PreS': 'Pred', 'PtcO': 'Pred', # collect some of the idiosyncratic BHSA functions
              'IntS': 'Intj', 'NCoS': 'NCop','ModS': 'Modi',
              'ExsS': 'Exst'}

for phrase in F.otype.s('phrase'):
    book, chapter, verse = T.sectionFromNode(phrase)
    book = book_map.get(book, book)
    # use constructional phrases only for Time function phrases
    # some time phrases are excluded, others follow a primary time phrase
    # ignore excluded TPs and secondary TPs
    if F.function.v(phrase) == 'Time':
        time_cx = L.u(phrase, 'construction')[0] if L.u(phrase, 'construction') else tuple()
        if not time_cx: # excluded TP
            continue
        elif list(L.d(time_cx, 'phrase')).index(phrase) == 0:
            phrase_functions[book]['Time'] += 1

    # count all other function types
    else:
        funct = functionmap.get(F.function.v(phrase), F.function.v(phrase))
        function = funct2function[funct].title()
        phrase_functions[book][function] += 1
    
    
phrase_functions = pd.DataFrame(phrase_functions).fillna(0)

The BHSA has some idiosyncatic functions that only occur a handfull of times relative to the whole corpus. See especially those below that fall below a frequency of 300:

In [None]:
phrase_functions.sum(1)

I've decided to remove these marginal forms from the analysis by selecting only those that occur total > 300 times. The new functions are seen below.

In [None]:
phrase_functions = phrase_functions[phrase_functions.sum(1) > 200]

phrase_functions.sum(1).sort_values(ascending=False)

In [None]:
expected_prop = phrase_functions.sum() / phrase_functions.sum().sum()
observed_prop = phrase_functions.div(phrase_functions.sum(1), axis=0)
prop_diffs = abs(expected_prop-observed_prop)
dp = prop_diffs.sum(1) / 2
dp = 1-pd.DataFrame(dp, columns=['Degree of Dispersion']).sort_values(by='Degree of Dispersion') # DP score finalized here, NB 1- to make it more intuitive (Bigger==more distributed)

In [None]:
expected_prop.head()

In [None]:
dp

In [None]:
dp.loc['Time'] - dp.loc['Adjunct']

In [None]:
title = 'Degree of Dispersion by Phrase Function per Book in Hebrew Bible (higher is more evenly distributed)'
save = 'paper_data/firstyear/phrase_DP.png'
countBarplot(dp, title=title, column='Degree of Dispersion', size=(15, 6), xlab_rotation='vertical', ylim=(1, 0.60), save=save)

It is significant here that the time construction is more consistently spread than the regular Adjunct function. Its spread relative to Location is harder to evaluate due to the presence of some Location functioning phrases residing in the Complement function. The BHSA labels many locative phrases as simple complements to movement verbs without providing a further distinction that they are likewise locative in nature. That is a shortcoming to the data. This data does tell us, however, that the Time function is more evenly spread than the generic adjunct function, and certainly it is more evenly distributed than Vocative or Question phrases.

We observed in the stripplot that the Time function appeared to be less attested in the books ranging from Isaiah through the end of the Megilloth. 

**Presented below is the difference in proportion, per book, between the expected proportion and the actual observed proportion of Time phrases**. They are sorted from greatest to least, with a higher value indicating that the Time function is under-represented in relation to the size of the book.

In [None]:
prop_diffs_book = pd.DataFrame((observed_prop-expected_prop).loc['Time'].sort_values(ascending=False))
prop_diffs_book.columns = ['difference']

prop_diffs_book

In [None]:
title = 'Difference in Expected / Observed Proportions Per Book for Time, a lower value means less were observed than expected'
countBarplot(prop_diffs_book, column='difference', title=title, size=(15, 6), xlab_rotation='vertical', save='paper_data/firstyear/dp_book_diff.png')

As expected from observing the strippchart, poetic books like Ezekiel, Job, Jeremiah, Proverbs, and Isaiah contain less than expected frequencies of Time function. The inclusion of Genesis in this group is surprising, although the barplot helps to see that the difference from Isaiah to Ezekiel is proportionately large. Likewise surprising is the difference in spread between Kings and Chronicles.

I am a bit curious how these differences compare with other kinds of functions. Let's look at the `Pred` function, the function that is said to be the most distributed.

In [None]:
prop_diffs_book_PRED = pd.DataFrame((observed_prop-expected_prop).loc['Predicate'].sort_values(ascending=False))
prop_diffs_book_PRED.columns = ['difference']
title = 'Difference in Expected / Observed Proportions Per Book for Pred, a lower value means less were observed than expected'
countBarplot(prop_diffs_book_PRED, column='difference', title=title, size=(15, 6), xlab_rotation='vertical')

It is very interesting that Chronicles and Daniel-Nehemiah are less verbal than expected, while Psalms is more so! To put it another way, relative to all other books in the corpus, Chronicles and Daniel-Nehemiah have a lower distribution of predicate phrases relative to the total number of phrases they contain.

### Excursus: Why is Pred so underpresented in Chronicles?

To answer this question, let's find the function which is most OVER-represented in the book...

In [None]:
title = 'Difference in Expected / Observed Function Proportions in Chronicles, a lower value means less were observed than expected'
diff_all = observed_prop - expected_prop
chronicles_diffs = pd.DataFrame(diff_all['Chronicles'].sort_values(ascending=False))
chronicles_diffs.columns = ['difference']
countBarplot(chronicles_diffs, title=title, column='difference', size=(15, 6), xlab_rotation='vertical')

The PreC seems like a possible candidate explanation...To find out for sure we could do a count of nominal clauses between all books, and see if Chronicles has a higher than expected proportion. But for now we will be satisfied with this.

### Examine Variety within the Head Lexemes of Phrases with Various Functions

I want to know how the Time phrase compares with other phrase functions in terms of the diversity of its head lexemes. In other words, does the Time function have a wide variety of terms that it regularly uses, or is it more highly selective of key terms? If the latter is true, it could show that time nouns are specialized in their use. Note that for this test, I do not look at lexical heads, but semantic heads. So, for instance, for a prepositional phrase I do not take the preposition but rather the object of the preposition. 

After making a count of all head lexeme/function co-ocurrences, I will normalize the number of lexemes per 100 uses of each function. The normalization is adapted from the helpful explanation of the [grammar lab](http://www.thegrammarlab.com/?p=160). I've adapted it by replacing "word counts" with "lexeme counts" and "corpus size" with "frequency of function." The frequency of all functions is calculated by simply taking the sum of the co-occurrence function counts.

In [None]:
# make a co-occurrence matrix of function columns and co-occurring head lexeme rows

function_heads = collections.defaultdict(lambda: collections.Counter())

for ph in F.otype.s('phrase'):
    
    if not E.head.t(ph): # it should have a head
        continue
    
    funct = functionmap.get(F.function.v(ph), F.function.v(ph))
    function = funct2function[funct].title()
    
    if function in {'Exst', 'EPPr'}:
        continue
    
    for head in E.nhead.t(ph):
        function_heads[function][F.lex.v(head)] += 1
        
function_heads = pd.DataFrame(function_heads).fillna(0)

Make the normalizations...

In [None]:
function_to_lexs = dict((funct, (function_heads[function_heads[funct] > 0]).shape[0]) for funct in function_heads)
function_to_lexs = pd.DataFrame.from_dict(function_to_lexs, orient='index')
function_to_lexs = function_to_lexs[(function_to_lexs > 4).all(1)]

In [None]:
norm_fs_lex = function_to_lexs*100
norm_fs_lex = norm_fs_lex.div(function_heads.sum(), axis='rows')
norm_fs_lex = norm_fs_lex.sort_values(by=0).dropna()

In [None]:
plt.figure(figsize=(13, 7))
sns.barplot(data=norm_fs_lex.transpose(), color='darkblue')
plt.xticks(rotation='vertical')
plt.ylabel('Unique Head Lexemes')
#plt.xlabel('Phrase Functions')
plt.savefig('paper_data/firstyear/unique_heads.png', dpi=300, bbox_inches='tight')
#plt.annotate('Time is very selective', xy=(10, 3), xytext=(10, 10), arrowprops=dict(facecolor='red', shrink=0.05), size=18)
plt.title('Number of Unique Head Lexemes per Function per 100 Uses')
plt.show()
display(norm_fs_lex)

## The Make-Up of Time Constructions

Beginning with their phrase types, I will analyze the kind of time constructions found in the corpus.

### Phrase Types Reflected in Constructions

`PP` is prepotional phrase, `NP` is noun phrase, `AdvP` is adverb phrase, as might be expected.

In [None]:
cx_types = collections.Counter()

for cx in F.otype.s('construction'):
    firstphrase = L.d(cx, 'phrase')[0]
    cx_types[F.typ.v(firstphrase)] += 1
    
cx_types = convert2pandas(cx_types)

cx_types

In [None]:
cx_types.to_excel(firstyear+'phrase_types.xlsx')

In [None]:
countBarplot(cx_types, title='Phrase Types in Time Construction Set', xlabel='Phrase Types', save='paper_data/firstyear/phrase_types.png')

Proportion of prepositional phrases...

In [None]:
cx_types.loc['PP']['Total'] / cx_types.sum()[0]

Thus, 67%.

There is a difference of 156% between the counts of NP and those of PP:

In [None]:
(cx_types.loc['PP']['Total'] - cx_types.loc['NP']['Total']) / cx_types.loc['NP']['Total']

The preposition is the most influential form within time constructions.

### Compare with unprocessed Time Phrases

In [None]:
tp_types = collections.Counter()

for ph in tp:
    tp_types[F.typ.v(ph[0])] += 1
    
tp_types = convert2pandas(tp_types)

display(tp_types)

countBarplot(tp_types)

In [None]:
tp_types.loc['AdvP'] - cx_types.loc['AdvP']

### Compare with Location

This includes `Loca` phrases as well as complement phrases with a semantic head that has high association with location phrases.

In [None]:
locations = A.search('''

phrase function=Cmpl
/with/
    <nhead- word LocaAssoc>2
/-/

''', shallow=True) | A.search('''

phrase function=Loca


''', shallow=True)

print(len(locations), 'total locations found...')



In [None]:
loca_types = collections.Counter()

for ph in locations:
    loca_types[F.typ.v(ph)] += 1
    
loca_types = convert2pandas(loca_types)

loca_types.index = ['PP', 'AdvP', 'NP', 'PrNP\n(proper noun phrase)']

display(loca_types)

countBarplot(loca_types, save=firstyear+'loca_types.png', xlabel='Phrase Functions')

In [None]:
loca_types.to_excel(firstyear+'loca_types.xlsx')

Compare percentage of prepositions...

In [None]:
loca_types.loc['PP'][0]  / loca_types.sum()[0]

#### Cases with a proper noun in Time Phrases

In [None]:
# A.show(A.search('''

# phrase function=Time
# /with/
#     <nhead- word st=c lex#JWM/|MWT/
#     <: word language=Hebrew pdp=nmpr
# /-/

# '''), extraFeatures='st')

### See if Differences Between Loca and Time are Statistically Significant

In [None]:
loca_types

In [None]:
cx_types

In [None]:
time_vs_loca = pd.concat([cx_types, loca_types], axis=1, sort=False).fillna(0)
time_vs_loca.columns = ['Time', 'Loca']

time_vs_loca

Apply Fisher's test for significance...

In [None]:
time_vs_loca_fish = apply_fishers(time_vs_loca)

time_vs_loca_fish

### Preposition & Time Associations

I want to see whether certain prepositions are particularly associated with certain time nouns. A version of this analysis was done [SBH_time_expressions](https://nbviewer.jupyter.org/github/CambridgeSemiticsLab/BH_time_collocations/blob/master/analysis/SBH_time_expressions.ipynb) for Genesis-Kings. Here we do the analysis for the entire Hebrew Bible.

The association measure is the Fisher's exact test.

In [None]:
prep_obj_counts = collections.defaultdict(lambda: collections.Counter())
prep2obj2res = collections.defaultdict(lambda: collections.defaultdict(list))
allpreps = collections.Counter()

for cx in F.otype.s('construction'):
    
    ph = L.d(cx, 'phrase')[0] # get first phrase
    
    if F.typ.v(ph) != 'PP':
        continue
            
    prep_chunk = next(obj for obj in L.d(cx, 'chunk') if F.label.v(obj) == 'prep') # get prep chunk
    prep_obj = E.obj_prep.t(L.d(prep_chunk, 'word')[-1])
    prep_text = '.'.join(F.lex_utf8.v(w) for w in L.d(prep_chunk, 'word'))
    allpreps[prep_text] += 1
    
    if prep_obj:
        obj_text = F.lex_utf8.v(prep_obj[0])
        prep_obj_counts[prep_text][obj_text] += 1
        prep2obj2res[prep_text][obj_text].append(L.d(cx, 'phrase'))
        
prep_obj_counts = pd.DataFrame(prep_obj_counts).fillna(0)
allpreps = convert2pandas(allpreps)

### Show Preposition Counts

In [None]:
allpreps.to_excel(firstyear+'prep_counts.xlsx')

Count בְּ's share...

In [None]:
allpreps.loc['ב'].sum() / allpreps.sum()[0]

In [None]:
search = A.search('''

phrase function=Time typ=PP
    <nhead- word lex=>RK=/

''')

A.show(search)

In [None]:
formatPassages(search)

### Apply the association test below. This will take some time...

In [None]:
po_assoc = apply_fishers(prep_obj_counts)

#### Attraction Plots

In [None]:
def assign_hue(iterable_data, p=1.3, maxvalue=10, minvalue=-10):
    '''
    Function to assign heat-map hues based 
    on a p-value midpoint and max/min attraction
    values.
    
    The following rules are used for making
    the colors:
    p = pvalue, i.e. significance level
    upper grey = p
    lower grey = -p
    starting red = p+0.1
    starting blue = -p-0.4
    max_red = max(dataset) if > p = hotmax
    max_blue = min(dataset) if < p = coldmax
    
    --output--
    1. a dataframe with values mapped to a unique color code
    2. a list of rgba colors that are aligned with the
       indices of the data
    '''
    
    maxvalue = int(maxvalue) # for max red
    minvalue = int(minvalue) # for max blue
        
    # assign ranges based on p values and red/blue/grey
    red_range = len(range(int(p), maxvalue+1))
    blue_range = len(range(int(p), abs(minvalue-1)))
        
    blues = sns.light_palette('blue', blue_range)
    reds = sns.light_palette('red', red_range)
    grey = sns.color_palette('Greys')[0]
    
    # assign colors based on p-value
    data = list()
    colorCount = collections.Counter()
    rgbs = list()
    for point in iterable_data:
        if point > p:
            rgb = reds[int(point)-1]
            color = 'red'
        elif point < -p:
            rgb = blues[abs(int(point))-1] 
            color = 'blue'
        else:
            rgb = grey
            color = 'grey'
            
        color_count = colorCount.get(color, 0)
        colorCount[color] += 1
        data.append([point, f'{color}{color_count}'])
        rgbs.append(rgb)
        
    data = pd.DataFrame(data, columns=('value', 'color'))
        
    return data, rgbs

In [None]:
# values for uniform hue assignment:
maxattraction = float(po_assoc.max().max())
minattraction = float(po_assoc.min().min())
pvalue = 1.3

def plot_attraction(prep, size=(15, 5), save=''):
        
    # get plot data and generate hues
    colexs = po_assoc[prep].sort_values()    
    colex_data, colors = assign_hue(colexs.values, p=pvalue, maxvalue=maxattraction, minvalue=minattraction)
    
    # plot the figure
    plt.figure(figsize=size)
    dummyY = ['']*colexs.shape[0] # needed due to bug with Y & hue
    ax = sns.swarmplot(x=colex_data['value'], y=dummyY, hue=colex_data['color'], size=15, palette=colors)
    ax.legend_.remove()
        
     # offset annotation text from dot for readability
    offsetX, offsetY = np.array(ax.collections[0].get_offsets()).T
    
    plt.xlabel('log10 Fisher\'s Scores (attraction)')
    
    # annotate lexemes for those with significant values
    for i, colex in enumerate(colexs.index):  
        annotateX = offsetX[i]
        annotateY = offsetY[i] - 0.06
        colex_text = reverse_hb(colex).replace('/','').replace('=','')
        if colexs[colex] > pvalue:
            ax.annotate(colex_text, (annotateX, annotateY), size=20, fontname='Times New Roman')
        elif colexs[colex] < -pvalue:
            ax.annotate(colex_text, (annotateX, annotateY), size=20, fontname='Times New Roman')
            
    if save:
        plt.savefig(f'paper_data/firstyear/{prep}_assocs.png', dpi=300, bbox_inches='tight')
    
    plt.title(f'Time Attractions to {reverse_hb(prep)}')
    plt.show()

Let's look at everything up to כ by setting a count limit of > 20.

In [None]:
for prep in prep_obj_counts.columns[(prep_obj_counts.sum() > 20)]:
    
    top_attractions = pd.DataFrame(po_assoc[prep].sort_values(ascending=False))
    top_attractions.columns = ['Fisher\'s Score']
    top_attractions['Raw Counts'] = prep_obj_counts[prep].loc[top_attractions.index]
    top_attractions.round(2).to_excel(firstyear+f'{prep}_top_assocs.xlsx')
    
    plot_attraction(prep, size=(18, 5), save=True)
    display(top_attractions.head(10))

Look at ממחרת...

In [None]:
min_mxrt = A.search('''

verse
    clause
        phrase function=Time
            <head- word lex=MN
            <obj_prep- word lex=MXRT/

''')

'; '.join(['{} {}:{}'.format(*T.sectionFromNode(res[0])) for res in min_mxrt if F.txt.v(res[1]) in {'N', '?N'}])

In [None]:
'; '.join(['{} {}:{}'.format(*T.sectionFromNode(res[0])) for res in min_mxrt if F.txt.v(res[1]) not in {'N', '?N'}])

Compare with מתמול

In [None]:
min_tmwl = A.search('''

verse
    clause
        phrase function=Time
            <head- word lex=MN
            <obj_prep- word lex=TMWL/

''')

'; '.join(['{} {}:{}'.format(*T.sectionFromNode(res[0])) for res in min_tmwl])

In [None]:
T.text(min_tmwl[0][0])

In [None]:
A.show(A.search('''

phrase function=Time
    <head- word lex=MN
    <obj_prep- word lex=RXM/

'''))

I can see that times which are attracted to ב are primarily calendrical times like "day", "year", "month", "morning", but also עת "time". The attraction between יום and ב is quite strong.

The ל preposition, as well as עד, prefers more deictic, adverbial kinds of indicators like לעולם, לצח, לפני, מחר. Indeed עד has nearly identical preferences. The association between ל and עולם is the strongest in the dataset:

In [None]:
print('top 5 association scores in dataset by their prep')
po_assoc.max().sort_values(ascending=False).head(5)

In [None]:
print('top 5 associations to ל')
po_assoc['ל'].sort_values(ascending=False).head(5)

This very strong score suggests the possibility that ל and עולם together constitute a strongly entrenched unit. Note also that the association between ל and נצח is likewise quite strong, as is the association with פנה. These smaller associations can be interpreted through the entrenched combination of ל+עולם.  

The preposition אחר has a distinct preference for nouns that are not necessary associated with time, such as proper names and nouns representing events. 

כ is attracted to עת, which is a notable similarity with ב. This is consistent with observations that these two prepositions have similar meanings. The use with תמול and אתמול are worth investigating. 

Finally, מן is primarily attracted to מחרת, a 

### Time Constructions, Raw Forms (without accents)

In [None]:
letter_inventory = set(l for w in F.otype.s('word') for l in F.voc_lex_utf8.v(w))

raw_surfaces = collections.Counter()

for cx in F.otype.s('construction'):
    surface = ''
    for w in L.d(cx, 'word'):
        for let in F.g_word_utf8.v(w):
            if let in letter_inventory:
                surface += let
        if F.trailer_utf8.v(w) in letter_inventory:
            surface += F.trailer_utf8.v(w)
    raw_surfaces[surface] += 1
        
raw_surfaces = convert2pandas(raw_surfaces)

raw_surfaces.head(20)

In [None]:
raw_surfaces.head(20).to_excel(firstyear+'raw_surfaces.xlsx')

### Time Constructions, Clustered on Raw Surface Forms without Vocalization (tokens)

In this section, I break down time constructions by clustering them based on surface forms and various surface form filters. This is a rough form of clustering, by which two time constructions are grouped together if their tokenized strings match.

In [None]:
def surfaceToken(phrasenode):
    '''
    Return a surface token of a phrase node.
    The words are dot-separated and heh consonants
    are added if they are present in vocalized form. 
    '''
    subtokens = []
    for w in L.d(phrasenode, 'word'):
        if F.lex.v(w) == 'H':
            subtokens.append('ה')
        else:
            subtokens.append(F.g_cons_utf8.v(w))
    return '.'.join(subtokens)
    

freq_surface = collections.Counter()
for cx in F.otype.s('construction'):
    freq_surface[surfaceToken(cx)] += 1
freq_surface = convert2pandas(freq_surface)

In [None]:
freq_surface.head(20)

In [None]:
freq_surface.to_excel(firstyear+'raw_tokens.xlsx')

In [None]:
freq_surface.head(50).sum()[0]

In [None]:
freq_surface.head(50).sum()[0] / len(list(F.otype.s('construction')))

ב.ה.יום.ה.הוא is a dominant pattern. But there are other patterns that are similar to it, such as עד.ה.יום.ה.זה or ב.ה.עת.ה.היא. Other similarities include ל.ֹעולם and עד.עולם. Taking a broader definition of similarity to include a role within the phrase, we can see similarities between the preposition + object constructions such as: ל.עולם, ב.יום, ל.נצח.

In [None]:
cases = '''
ב.ה.יום.ה.הוא
עד.ה.יום.ה.זה
ב.ה.עת.ה.היא
ב.ה.ימים.ה.הם
ב.ה.עת.ה.הוא 
ב.ה.לילה.ה.הוא 
ב.עצם.ה.יום.ה.זה 
'''.split('\n')
demos = [c.strip() for c in cases if c]

freq_surface.loc[demos].sum()[0]

In [None]:
freq_surface.loc[demos].sum()[0] / len(list(F.otype.s('construction')))

In [None]:
defi = '''
ה.יום
ב.ה.בקר
עד.ה.ערב
ה.לילה
ב.ה.ערב
ב.ה.לילה'''.split('\n')

defis = [c.strip() for c in defi if c]

freq_surface.loc[defis].sum()[0]

In [None]:
freq_surface.loc[defis].sum()[0] / len(list(F.otype.s('construction')))

### Count Semantic Head Lexemes

In [None]:
sem_heads = collections.Counter()

for cx in F.otype.s('construction'):
    
    firstphrase = L.d(cx, 'phrase')[0]
    semhead = E.nhead.t(firstphrase)[0]
    
    sem_heads[F.voc_lex_utf8.v(semhead)] += 1
    
sem_heads = convert2pandas(sem_heads)

sem_heads.head(25)

In [None]:
sem_heads.head(50).to_excel(firstyear+'semantic_heads.xlsx')

Headed by מלכות

In [None]:
# A.show(A.search('''

# construction
#     =: phrase
#     /with/
#     <nhead- word lex=MLKWT/
#     /-/

# '''))

Headed by ראשׁ

In [None]:
A.show(A.search('''

construction
    =: phrase
    /with/
    <nhead- word lex=<WD/
    /-/

'''))

### Time Constructions, Clustered on Parts of Speech and Chunks

Based on the kinds of resemblances mentioned above, I wanted to obtain a clustering that better reflected word types and sub-constructions within the time constructions. A "sub-construction", what I have called "chunks", consist of either chained prepositional phrases: e.g. מקץ "from the end of...", or quantified noun phrases, which can consist of chained cardinal numbers such as שׁבעים ושׁשׁ שׁנה. These chunks were processed in [chunking](https://nbviewer.jupyter.org/github/CambridgeSemiticsLab/BH_time_collocations/blob/master/analysis/preprocessing/chunking.ipynb) and then further refined into complete tags in [time constructions [part 1]](https://nbviewer.jupyter.org/github/CambridgeSemiticsLab/BH_time_collocations/blob/master/analysis/time_constructions1.ipynb).

The result is a tokenization strategy which produces larger, more useful clusters. In fact, the top 11 of these clusters account for 76% of the entire dataset, as I show.

In [None]:
freq_times = collections.Counter()
for cx in F.otype.s('construction'):
    freq_times[F.label.v(cx)] += 1
freq_times = convert2pandas(freq_times)

In [None]:
freq_times.head(20)

Top 11 account for 76% of the dataset:

In [None]:
freq_times.head(11)['Total'].sum() / freq_times['Total'].sum()

The top 20 account for 83% of the dataset:

In [None]:
freq_times.head(20)['Total'].sum() / freq_times['Total'].sum()

It is my hunch that the remaining 25% / 17% of the data most often consists of some combination of the major types reflected in the top 75% group. Thus by describing and understanding these major types, we can obtain even better clustering parameters.

From this point forward I will focus on accounting for the subgroups found amongst these major clusters.

## `prep.time`

What kind of time nouns most often appear in the `time` slot?

In [None]:
pt_time = collections.Counter()
pt_prep = collections.Counter()
pt_cx = collections.Counter()

tag2res = collections.defaultdict(list)

for cx in set(F.otype.s('construction')) & set(F.label.s('prep.time')):
    time = next(role[0] for role in E.role.t(cx) if role[1]=='time')
    prep = next(role[0] for role in E.role.t(cx) if role[1]=='prep')
    time_text = F.lex_utf8.v(time)
    prep_text = '.'.join(F.lex_utf8.v(w) for w in L.d(prep, 'word'))
    cx_text = '.'.join(F.g_cons_utf8.v(w) for w in L.d(cx, 'word'))
    
    pt_time[time_text] += 1
    pt_prep[prep_text] += 1
    pt_cx[cx_text] += 1
    tag2res[cx_text].append(L.d(cx, 'phrase'))
    tag2res[time_text].append(L.d(cx, 'phrase'))
    tag2res[prep_text].append(L.d(cx, 'phrase'))
    
pt_time = convert2pandas(pt_time)
pt_prep = convert2pandas(pt_prep)
pt_cx = convert2pandas(pt_cx)

#### Top Raw Surface Form Counts

In [None]:
pt_cx.head(20)

#### Top Times

In [None]:
pt_time.head(20)

#### Top Preposition Counts

In [None]:
pt_prep

Does עולם ever have additional modifications? I know from previous analysis of time constructions that they often have various morphological modifications or additional specifications. I would expect this to be different with עולם, and I would also expect this situation to resemble other words that are being used adverbially. If there is indeed a strict separation between patterns with and without these kinds of modifications, I may have good reason to define this as an "adverb construction," i.e. a construction with deictic sense and that caries its temporal modifications internally. 

Practically it makes more sense to first define what I mean, especially in terms of database querying, of "modifications." In order to do that, I move on to the next most common item in the list, יום. I know from the previous analysis that יום *does* in fact attract these modifications. By definining them here, I might have a way to identify other cases that have such modifications. Then I can define those without modifications as the inverse of these search parameters.

**Below are a few examples of יום as used with a preposition.** The examples are shown in the context of a sentence, because infinitival modifiers of יום will exist occur as clauses embedded in the same sentence. These cases in particular are marked with a clause relation of `RgRc` (Regens/rectum connection). Note that I have collapsed the cases with `end=1`. Modify this to see all the other examples. 

In [None]:
A.show(tag2res['יום'], condenseType='sentence', extraFeatures='st vt', end=1) # <- NB modify end= to see more than 5 examples

In [None]:
random.shuffle(tag2res['יום'])

In [None]:
jwm = tag2res['יום']

for ph in jwm[:5]:
    print('{} {}:{}'.format(*T.sectionFromNode(ph[0])))
    print(T.text(L.u(ph[0], 'sentence')[0]))
    print()

In [None]:
A.search('''

sentence
    phrase function=Time
    <nhead- word lex=JWM/
    <: word lex=>CR

''')

In [None]:
T.sectionFromNode(1181030)

In [None]:
T.text(1181030)

In [None]:
T.text(L.u(1181030, 'verse')[0])

After reviewing several dozen cases, I see 4 specific patterns that follow the construction ב+יום:

* \+ [CONSTRUCT] + [VERBAL CLAUSE rela=RgRc] (often with infinitive but occasionally with qatal)
* \+ [PLURAL ENDING]
* \+ [PRONOMINAL SUFFIX]
* \+ [אשׁר in VERBAL CLAUSE rela=Attr]

Let's see how much of the יום pattern this accounts for. The individual cases are stored under `tag2res['יום']`. We define a few search patterns to account for the cases above. The phrases stored in `tag2res` are fed in as sets so that only those cases are queried.

In [None]:
yom_phrases = set(phrase for res in tag2res['יום'] for phrase in res)
found_yom = set()

print(len(yom_phrases), 'total יום phrases')

# + CONSTRUCT + VERBAL CLAUSE
verbal_construct = set(res[1] for res in A.search('''

sentence
    yomphrase
        word lex=JWM/
        /with/
        <mother- clause rela=RgRc kind=VC
        /or/
        y1:yomphrase
            ..
        c1:clause rela=Attr
        y1 <mother- c1
        /-/

''', sets={'yomphrase': yom_phrases}, silent=True))
found_yom |= (verbal_construct)

print(f'verbal construct cases found: {len(verbal_construct)}')


# + PLURAL
pluralday = set(res[1] for res in A.search('''

sentence
    yomphrase
        word lex=JWM/ nu=pl

''', sets={'yomphrase': yom_phrases}, silent=True))
found_yom |= (pluralday)

print(f'plural cases found: {len(pluralday)}')

# + PRONOMINAL 
pronominalday = set(res[1] for res in A.search('''

sentence
    yomphrase
        word lex=JWM/ prs#absent

''', sets={'yomphrase': yom_phrases}, silent=True))
found_yom |= (pronominalday)
print(f'pronominal suffix cases found: {len(pronominalday)}')

# + אשׁר/relative + VERBAL CLAUSE 
asher_day = set(res[1] for res in A.search('''

sentence
        yomphrase
            word lex=JWM/
            /with/
            sentence
                ..
                <: clause rela=Attr
                    =: phrase function=Rela
            /-/

''', sets={'yomphrase': yom_phrases}, silent=True))
found_yom |= (asher_day)
print(f'relative attributive cases found: {len(asher_day)}')

print(f'remaining cases: {len(yom_phrases-found_yom)}')

Let's look at the 5 remaining cases...

In [None]:
rare_jwm = [(case,) for case in yom_phrases-found_yom]
A.show(rare_jwm, extraFeatures='st nu', condenseType='sentence') # uncomment to see cases

In [None]:
T.sectionFromNode(822516)

In [None]:
T.text(L.u(822516, 'verse')[0])

In [None]:
A.show(A.search('''

phrase function=Time
    <head- word lex=MN
    <obj_prep- word lex=<WLM/


'''))

In [None]:
T.sectionFromNode(722459)

In [None]:
T.text(L.u(722459, 'verse')[0])

The remaining cases are all interesting, especially Psalm 138:3 and Ruth 4:5. These may be true cases of non-modification, **construing יום as an adverb.** The case of Ezra 3:4 does not look like an adverbial use of the time construction. I will consider removing it from the samples moving forward. 

There is one important case that I did not account for at first: the dual ending. I added that below.

In [None]:
# + PLURAL
dualday = set(res[1] for res in A.search('''

sentence
    yomphrase
        word lex=JWM/ nu=du

''', sets={'yomphrase': yom_phrases}, silent=True))
found_yom |= (dualday)

print(f'dual cases found: {len(dualday)}')

Below this final case is added to the others, bringing the total construction forms to 5:

* \+ [CONSTRUCT] + [VERBAL CLAUSE rela=RgRc | occasionally Attr in BHSA] (often with infinitive but occasionally with qatal)
* \+ [PLURAL ENDING]
* \+ [PRONOMINAL SUFFIX]
* \+ [אשׁר in VERBAL CLAUSE rela=Attr]
* \+ [DUAL ENDING]

**Based on these features, I propose to attempt a two-way subdivision of all constructions in the `prep.time` construction: 1) those that appear with specification, and 2) those that appear without specification.** I will test the efficacy of this division below with a handcoded version of the templates from above. 

The specified times will go into `cx_specified` mapped to the form that they were found in.

NB: I have moved the plural to the bottom. The plural can offten co-occur with other specifications. Yet it seems that the other specifications have the "final say" so-to-speak, in the sense that they are still able to function as they do. The effect of the plural is the same as it is elsewhere: to extend the time over a duration through quantification.

In [None]:
def tagSpecs(cx):
    
    '''
    A function that queries for 
    specifications on a time noun
    or phrase within a construction 
    marked for time function.
    
    output - string
    '''
    
    phrase = L.d(cx, 'phrase')[0]
    time = next(role[0] for role in E.role.t(cx) if role[1]=='time')
    time_mother = [cl for cl in E.mother.t(time) if F.rela.v(cl) == 'RgRc']
    phrase_mother = [cl for cl in E.mother.t(phrase) if F.rela.v(cl) == 'Attr']
    result = (phrase, time)

    
    tag = []
    
    # isolate construct + verbal clauses
    if time_mother:
        tag.append('construct + VC')
        
    elif F.st.v(time) == 'c':
        tag.append('construct + NP?')
        
    # isolate pronominal suffixes
    if F.prs.v(time) not in {'absent', 'n/a'}:
        tag.append('pronominal suffix')
        
    # isolate relative clauses | attributives
    if phrase_mother:
        if 'Rela' in set(F.function.v(ph) for ph in L.d(phrase_mother[0], 'phrase')):
            tag.append('RELA + VC')
        else:
            tag.append('+ VC')
        
    # isolate plural endings
    if F.nu.v(time) == 'pl' and F.pdp.v(time) not in {'prde'}: # exclude plural forms inherent to the word
        tag.append('plural')
        
    # isolate dual endings
    if F.nu.v(time) == 'du':
        tag.append('dual')
        
    return ' & '.join(tag), result, time


cx_specified = collections.defaultdict(list)
lex2tag2result = collections.defaultdict(lambda: collections.defaultdict(list)) # keep a mapping from time lexemes to their specific results 

for cx in set(F.otype.s('construction')) & set(F.label.s('prep.time')):
    
    tag, result, time = tagSpecs(cx)
    
    if tag:
        cx_specified[tag].append(result)
        lex2tag2result[F.lex_utf8.v(time)][tag].append(result)
        
cx_specified_all = set(res for tag in cx_specified for res in cx_specified[tag])
        
found = len(set(res[0] for res in cx_specified_all))

print(f'number found {found} ({found / pt_cx["Total"].sum()})')

In [None]:
for tag, results in cx_specified.items():
    print('{:<30} {}'.format(tag, len(results)))

These patterns thus account for 40% of the cases in this construction. That is a good discrimination rate.

In [None]:
#A.show(cx_specified['construct + NP?'], condenseType='sentence', extraFeatures='st')

Let's see what lexemes those accounted for...

In [None]:
lex_count = collections.Counter()

for phrase, time in cx_specified_all:
    lex_count[F.lex_utf8.v(time)] += 1
    
lex_count = convert2pandas(lex_count)

lex_count

In [None]:
lex2tag2result['פנה']

In [None]:
#A.show(lex2tag2result['יום']['pronominal suffix'])

Many of these lexemes are quite similar to יום in terms of being calendrical or having similar prepositional preferences.

Below are lexemes that were not found.

In [None]:
not_specified = []
nsresults = collections.defaultdict(list)

spec_set = set(res[0] for res in cx_specified_all)

for cx in set(F.otype.s('construction')) & set(F.label.s('prep.time')):
    
    phrase = L.d(cx, 'phrase')[0]
    
    if phrase not in spec_set:
        time = next(role[0] for role in E.role.t(cx) if role[1]=='time')
        result = (phrase, time)
        not_specified.append(result)
        nsresults[F.lex_utf8.v(time)].append(result)
        
print(len(not_specified))

In [None]:
lex_count2 = collections.Counter()

for phrase, time in not_specified:
    lex_count2[F.lex_utf8.v(time)] += 1
    
lex_count2 = convert2pandas(lex_count2)

lex_count2

This is a strong list of adverbial forms. There are also several nouns mixed in. Note the appearance of יומם as well, which is a great example of a nominal form that is slotted as an adverbial—in this case that is obvious because of the adverbial ending that is appended to it.

**The cases above, as they are not modified, are anchored either to discourse context or the time of speech.** Others, such as בטן, have rather inferred anchor points. It would be interesting to isolate when the reference is discourse-anchored. For example, the case of proper names this would be relatively easy to ascertain. However, most of these seem to be anchored to speech time.

In [None]:
A.show(tag2res['זאת'], extraFeatures='prs')

This case ^ is a good example, though, of a discourse-anchored form.

Below, I randomize the not specified list and manually inspect many examples to make sure there are no specifications I've missed.

In [None]:
random.shuffle(not_specified)

In [None]:
#A.show(not_specified, condensed=False, condenseType='sentence')

## The Role of Specifications

After reflecting on the specifications that I've isolated in the `prep.time` group, I am wondering what they have in common and where they differ. The + verbal clause specifications and pronominal suffixes anchor the time references to specific participants or events in the discourse. The pronominal suffixes also have a commonality with the verbal clauses since both contain markers of person, often identically so as the infinitive accepts the pronominal suffix. The other two specifications, that of the plural and the dual, then seemingly have a quite different role to play. They do not anchor the time, but they modify it by extending it in quantity, which metaphorically indicates a duration of time. In the case of the dual this duration is specified. **Furthermore, the plural differs from the other specifications in that it is compatible with them—in all other cases the specifications are mutually exclusive.**

The time of יום accepts all of the major roles, revealing its multi-purpose utility as the generic time marker, as seen below:

In [None]:
for spec, results in lex2tag2result['יום'].items():
    print('{:<30} {}'.format(spec, len(results)))

עת, as a seeming near synonym of יום, also accepts a variety of specifications:

In [None]:
for spec, results in lex2tag2result['עת'].items():
    print('{:<30} {}'.format(spec, len(results)))

The next most common term, פנה, appears in all cases in the plural, but in 6 cases with an additional specification of the suffix.

In [None]:
for spec, results in lex2tag2result['פנה'].items():
    print('{:<30} {}'.format(spec, len(results)))

מות only occurs with the pronominal suffix though:

In [None]:
for spec, results in lex2tag2result['מות'].items():
    print('{:<20} {}'.format(spec, len(results)))

After more data has been gathered, it would be a good idea to see whether there are any statistical associations between certain terms and specification. For instance, it is clear the עולם has a strong association with non-specification. Then some terms are used both with and without it.

**Looking through the list of other major clusters besides `prep.time`, it seems that the other clusters are likewise defined by different methods of specification:** 

In [None]:
freq_times.head(10)

**I propose the possibility that specification is the means by which a time is anchored to discourse.** That is self-evident in the prominence of the form `prep.H.time.H.dem`, i.e. the demonstrative plays a front-and-center role, anchoring the time noun to a point forward or backward relative to the discourse. The same is evident with the cluster `prep.H.time.H.ordn`, with the ordinal number anchoring the time to a day or month on the calendar.

Other clusters potentially resemble the unspecified times found above, such as `quantNP`, a quantified noun phrase. These times technically have the specification of quantification, but it is likely, as we have seen with the plural, that this form can variously be combined with or without additional specifications.

## Co-specifications with `prep.H.time.H.dem`?

This is the next most frequent cluster in the set. With the demonstrative already in place, it seems likely that this construction deflects additional specifications. Let's write a query to see if this is so. The query will utilize similar parameters as were used to separate specified from non-specifieds above.

In [None]:
prephdem_specs = collections.Counter()
phdtag2res = collections.defaultdict(list)

for cx in set(F.otype.s('construction')) & set(F.label.s('prep.H.time.H.dem')):
    tag, result, time = tagSpecs(cx)
    tag = tag or 'no further specification'
    prephdem_specs[tag] += 1
    phdtag2res[tag].append(result)
    
convert2pandas(prephdem_specs)

We have 4 cases of additional specification. Let's look closer...

In [None]:
A.show(phdtag2res['RELA + VC'], condenseType='sentence')

In [None]:
A.show(phdtag2res['RELA + VC & plural'], condenseType='sentence')

These specifications *may* reveal a difference in the attributive specification characterized by relative particles, and the specification characterized by the construct. The attributive spec can describe an anchored time. But the construct spec, if it plays an anchoring role itself, may resist being combined with additional anchors.

The majority of cases, though, seem to disprefer specification. I will examine some random selections from the `no further specification` set to make sure.

In [None]:
random.shuffle(phdtag2res['no further specification'])

In [None]:
#A.show(phdtag2res['no further specification'], condensed=False, condenseType='sentence')

## Specifications with `quantNP`

I will apply the same query method with `quantNP`, to see how much of the data is accounted for and whether any new specifications are missed. The function has to be modified a bit to interact with the `quantNP` chunk.

In [None]:
def tagSpecsQuant(cx):
    
    '''
    A function that queries for 
    specifications on a time noun
    or phrase within a construction 
    marked for time function.
    
    output - string
    
    Note on Quantifier Constructions:
    The quantNP can be a complex construction.
    It is built of smaller quantNP chunks, 
    perhaps a single chunk or perhaps more.
    The "quantified" edge value identifies a word as the 
    time noun being quantified. But this is only stored on
    the lowest level chunks. A few extra steps are needed
    to isolate these nouns and check them for specifications.
    '''
    
    phrase = L.d(cx, 'phrase')[0]
    phrase_mother = [cl for cl in E.mother.t(phrase) if F.rela.v(cl) == 'Attr'] # look for attr rela on phrase
    
    # isolate component quantNP chunks
    atomic_chunks = [chunk for chunk in L.d(cx, 'chunk') 
                         if L.u(chunk, 'chunk') # either is not top level chunk
                         or len(L.d(cx, 'chunk')) == 1 # or has no embedded chunks
                    ]
    # get list of quantified time noun(s)
    times = [noun[0] for chunk in atomic_chunks for noun in E.role.t(chunk) if noun[1] == 'quantified']    
    time_mothers = [cl for time in times for cl in E.mother.t(time) if F.rela.v(cl) == 'RgRc']
    
    result = [phrase] + times
    tag = []
    
    # isolate construct + verbal clauses
    if time_mothers:
        tag.append('construct + VC')
    
    elif set(t for t in times if F.st.v(t) == 'c'):
        tag.append('construct + ??')
        
    # isolate pronominal suffixes
    if set(t for t in times if F.prs.v(t) not in {'absent', 'n/a'}):
        tag.append('pronominal suffix')
        
    # isolate relative clauses | attributives
    if phrase_mother:
        if 'Rela' in set(F.function.v(ph) for ph in L.d(phrase_mother[0], 'phrase')):
            tag.append('RELA + VC')
        else:
            tag.append('+ VC')
        
    # isolate plural endings
    if set(t for t in times if F.nu.v(t) == 'pl' and F.pdp.v(t) not in {'prde'}): # exclude plural forms inherent to the word
        tag.append('plural')
        
    # isolate dual endings
    if set(t for t in times if F.nu.v(t) == 'du'):
        tag.append('dual')
        
    return ' & '.join(tag), result

In [None]:
quantnp_specs = collections.Counter()
qnptag2res = collections.defaultdict(list)


for cx in set(F.otype.s('construction')) & set(F.label.s('quantNP')):
    tag, result = tagSpecsQuant(cx)
    tag = tag or 'no known spec'
    quantnp_specs[tag] += 1
    qnptag2res[tag].append(result)
    
convert2pandas(quantnp_specs)

The plural specs are expected with the quantifier NP. As above, the relative attributive spec has appeared. But the quantified NP has resisted any construct relations.

In [None]:
A.show(qnptag2res['RELA + VC & plural'], condenseType='sentence')

I will inspect randomized cases of `no known spec` below.

In [None]:
random.shuffle(qnptag2res['no known spec'])

In [None]:
#A.show(qnptag2res['no known spec'], condensed=False, condenseType='sentence')

## TODO: MERGE SEVERAL OF THESE KINDS OF PHRASES

In [None]:
# A.show(A.search('''

# phrase function=Time
# /with/
# clause
#     ..
#     <: phrase function=Modi
# /or/
# clause
#     phrase function=Modi
#     <: ..
# /-/

# '''), condenseType='sentence')

## `prep.H.time`

In [None]:
prephtime = collections.Counter()
phttag2res = collections.defaultdict(list)


for cx in set(F.otype.s('construction')) & set(F.label.s('prep.H.time')):
    tag, result, time = tagSpecs(cx)
    tag = tag or 'no known spec'
    prephtime[tag] += 1
    phttag2res[tag].append(result)
    
convert2pandas(prephtime)

In [None]:
freq_times.loc['prep.time.adju']

In [None]:
#freq_times.head(50)

In [None]:
# A.show(A.search('''

# phrase function=Time
#     word lex=JWM/ st=c
#     <: word pdp=subs

# '''))

# Feature-Based Clustering in a Complex Constructional Network

After the analysis thus far, I believe I have gathered a list of features which are fairly efficacious at separating time constructions:

* PP | NP
* ה time
* H time H ___
    * \+ demonstrative
    * \+ ordinal
    * \+ attributive
* construct
    * construct + VP
    * construct + NP
* attributive (+אשר)
* pronominal suffix
* plural via du | pl endings
* quantification via quantNP

It is important that several of these features are "stackable"—meaning that they can be combined in different ways. The ultimate goal is to achieve a taxonomy of time constructions at the phrasal level. So how should we think about these various pieces and their inter-relatability?

One way to represent these structures would be with tree-like inheritance, so that a timePP inherits its NP patterns from timeNP. The limitations of this approach can be seen in the following construction:

* ב.ה.יום.ה.זה (e.g. Ex 19:1, Lev 8:34; 88x total) 
* ה.יום.ה.זה (e.g. Deut 2:25, 5:24; 29x total)

In [None]:
# examples of ביום הזה and הים הזה isolated below...

bhjwm = [L.d(cx,'phrase') for cx in F.label.s('prep.H.time.H.dem')
           if not {'JWM/', 'ZH'} - set(F.lex.v(w) for w in L.d(cx, 'word'))]

hjwm = [L.d(cx,'phrase') for cx in F.label.s('H.time.H.dem')
           if not {'JWM/', 'ZH'} - set(F.lex.v(w) for w in L.d(cx, 'word'))]

print(f'{len(bhjwm)} of ב.ה.יום.ה.זה found in dataset...')
print(f'{len(hjwm)} of ה.יום.ה.זה found in dataset...')

In a tree-based taxonomy, these two constructions would be separated into two different groups: PP phrases and NP phrases. This is problematic, because the taxonomy then misses the relatedness of the two phrases. Furthermore, the close relation could indeed be crucial to understanding the semantics of the NP version of this phrase: because ב.ה.יום.ה.זה is more common, hence more entrenched, it should inform how we read ה.יום.ה.זה. Indeed, this phrase seems to have very similar semantics to ב.ה.יום.ה.זה. In fact, the NP ה.יום.ה.זה would seem to have even less in common with other bare NP's, which typically do not indicate a point in time but a duration in time, e.g. שׁבעים שׁנים "for seven years."  

A better representation is a graph network, wherein constructions are represented as nodes and inheritances between them as edges. This format allows for multiple inheritance paths, and for numerous features to be modeled and compared at once. Furthermore, by utilizing the notion of a graph, it is possible to cluster based on a set-like comparison between features of constructions. For instance, in the case above: three similarities are registered, the presence of the *heh* definite article, the `H.time.H.modifier` construction, and the presence of the demonstrative. Thus, accounting for the difference of a preposition, we could place a value on this similarity, for instance, by saying they are 3/4 or 75% similar (the Jaccard similarity measure). These similarity values give the raw material needed to build clusters and taxonomic relations in a complex network. A nearest-neighbor clustering method (e.g. T-SNE) can find neighborhoods of similar constructions to identify distinct clusters within the constructions. A method such as PCA can be utilized to find major divisions and distinguish which features create the most separation in the dataset. The feature sets also allows us to easily select constructions based on the presence of a given feature. And we can look for associations and restrictions between particular features.

## Isolating a Test Set

I make one exclusion:

* exclude complex PP and NP constructions, i.e. those with coordination

Except for: 

* coordination in quantNP constructions is allowed

In [None]:
tagcount = collections.Counter()
tag2res = collections.defaultdict(list)
testset = set()

test = []

for cx in F.otype.s('construction'):
    
    cx_words = L.d(cx, 'word')
    phrase = L.d(cx, 'phrase')[0]
    ph_words = L.d(phrase, 'word')
    
    # either there is no conj or quantNP
    is_quant = 'quantNP' in F.label.v(cx)
    conj_check = ('conj' not in set(F.pdp.v(w) for w in cx_words)) or is_quant
    # either there is only 1 time or is quantNP
    ntime_check = (len([w for w in E.role.t(cx) if w[1] == 'time']) < 2) or is_quant
    
    singlephrase = len(L.d(cx, 'phrase')) == 1
    pp_notin_np = not (F.typ.v(phrase) == 'NP' and 'prep' in set(F.pdp.v(w) for w in ph_words))
    nprep_check = [F.pdp.v(w) for w in cx_words].count('prep') < 2
    
    is_match = all([cx_words, conj_check, ntime_check, 
                    nprep_check, pp_notin_np, singlephrase])
    
    if is_match:
        tagcount[F.label.v(cx)] += 1
        tag2res[F.label.v(cx)].append(L.d(cx, 'phrase'))
        testset.add(cx)
        
tagcount = convert2pandas(tagcount)

print(tagcount.shape)

print('total time constructions: ', freq_times['Total'].sum())
print('accounted for: ',  tagcount['Total'].sum(),' or ', tagcount['Total'].sum() / freq_times['Total'].sum())

tagcount.head(20)

In [None]:
#A.show(tag2res['time.adju'], extraFeatures='sem_set')

In [None]:
# for tag in tagcount.index:
#     print(tag)
#     A.show(tag2res[tag][:1], extraFeatures='st')

In [None]:
def getIndex(thislist, index):
    '''
    A safe way to get index from 
    a list/tuple. If indexError returns None.
    '''
    try:
        return thislist[index]
    except IndexError:
        return None

def getQuantTimes(cx):
    '''
    Extracts times from a quant chunk
    '''
    # isolate component quantNP chunks
    atomic_chunks = [chunk for chunk in L.d(cx, 'chunk') 
                         if L.u(chunk, 'chunk') # either is not top level chunk
                         or len(L.d(cx, 'chunk')) == 1 # or has no embedded chunks
                    ]
    # get list of quantified time noun(s)
    times = [noun[0] for chunk in atomic_chunks for noun in E.role.t(chunk) if noun[1] == 'quantified']
    return times
 
def isQualQuant(word):
    if F.sem_set.v(word) == 'quant' and F.ls.v(word) != 'card':
        return True
    else:
        return False
    
def tagConstructionSpecs(cx):
    '''
    A function that tags time constructions
    with specifications found around their 
    time nouns. Returns a dictionary with
    spec strings as keys and 1 as values,
    wherein 1 simply means present.
    '''
    
    phrase = L.d(cx, 'phrase')[0]
    ph_words = L.d(phrase, 'word')
    sent_words = L.d(L.u(phrase, 'sentence')[0], 'word')
    dep_cl = next((cl for cl in E.mother.t(phrase) if F.rela.v(cl) == 'Attr'), None)
    times = [time[0] for time in E.role.t(cx) if time[1] == 'time'] or getQuantTimes(cx) or E.nhead.t(phrase)
    features = {}
    
    # phrase type, PP or NP, wherein AdvP are considered a kind of NP
    typ = 'PPtime' if F.typ.v(phrase) == 'PP' else 'time'
    features[typ] = 1
    
    for time in times:
        
        # get relative slot positions
        timei = ph_words.index(time)
        m1 = getIndex(ph_words, timei-1) # minus 1, etc.
        m2 = getIndex(ph_words, timei-2)
        p1 = getIndex(ph_words, timei+1) # plus 1, etc.
        p2 = getIndex(ph_words, timei+2)
        # relative slots in sentence
        timei_s = sent_words.index(time)
        s1 = getIndex(sent_words, timei_s+1)
        
        # preceding article
        if F.lex.v(m1) == 'H':
            features['H'] = 1
        
        # plurals
        if F.nu.v(time) == 'pl' and F.pdp.v(time) != 'prde':
            features['pl'] = 1
            
        elif F.nu.v(time) == 'du':
            features['quant'] = 1
            features['du'] = 1
        
        # pronom suffixs
        if F.prs.v(time) not in {'absent', 'n/a'}:
            features['sffx'] = 1
        
        # check quant & qual quants
        is_quant = set(ch for ch in L.u(time, 'chunk') 
                          if F.label.v(ch) and 'quant' in F.label.v(ch))
        if is_quant:
            features['quant'] = 1
            features['card'] = 1
            
        is_qualq = any([isQualQuant(m1),
                        isQualQuant(m2) and F.lex.v(m1) == 'H',
                        F.lex.v(p1) == 'H' and isQualQuant(p2)])
        if is_qualq:
            features['quant'] = 1
            features['qual'] = 1
        
        # constructs
        if F.st.v(time) == 'c':
            next_word = p1 if F.pdp.v(p1) != 'art' else p2
            next_verb = s1
            features['construct'] = 1
            if F.pdp.v(next_verb) == 'verb':   
                features['cons+VC'] = 1
            elif F.pdp.v(next_word):
                features['cons+NP'] = 1
                
        # h.time.h.spec pattern
        if F.lex.v(m1) == 'H' and F.lex.v(p1) == 'H':
            features['attr_patt'] = 1
            
        # demonstrative / ordinal / qualquant / spec
        is_dem = any([F.pdp.v(p1) == 'prde',
                      F.lex.v(p1) == 'H' and F.pdp.v(p2) == 'prde'])
        is_ordn = F.lex.v(p1) == 'H' and F.ls.v(p2) == 'ordn'
        
        is_spec = all([F.lex.v(m1) == 'H', 
                       F.lex.v(p1) == 'H',
                       not F.lex.v(p1) == 'H' and isQualQuant(p2),
                       not is_dem, not is_ordn, not is_qualq])
        if is_dem:
            features['demon'] = 1
        elif is_ordn:
            features['ord'] = 1
        elif is_spec:
            features['attrb'] = 1
        
        # attributives
        small_sp = next(iter(sorted(L.u(time, 'subphrase'))), 0)
        attr_relas = set(rel_sp for rel_sp in E.mother.t(small_sp) if F.rela.v(rel_sp) == 'atr')
        
        if attr_relas and  not {'demonstrative', 'ordinal', 'attribute', 'qualitative quant.'} & set(features.keys()):
            features['adjv'] = 1    
        
    # tag relative/attributive specs dependent on phrase
    if dep_cl:
        rel = 'rela' if 'Rela' in set(F.function.v(ph) for ph in L.d(dep_cl, 'phrase')) else ''
        clkind = F.kind.v(dep_cl)        
        features[f'{rel}+{clkind}'] = 1        
    
    tag = '.'.join(features.keys())
    result = (cx,) + L.d(cx, 'phrase') + tuple(times)
    
    return tag, result, features

Run the search...

In [None]:
tag2result = collections.defaultdict(list)
spec2result = collections.defaultdict(list)
cx2tag = {}
cx2preptags = {}
specdata = {}

for cx in testset:
    tag, result, features = tagConstructionSpecs(cx)
    specdata[result[0]] = features # store on first cx node
    tag2result[tag].append(result)
    cx2tag[cx] = tag
    
    # build tags with prepositions
    phrase = L.d(cx, 'phrase')[0]
    if F.typ.v(phrase) == 'PP':
        prep = next(ch for ch in L.d(phrase, 'chunk') if F.label.v(ch) == 'prep')
        prep_txt = ''.join(F.lex.v(w) for w in L.d(prep, 'word'))
        prep_tag = tag.replace('PPtime', prep_txt+'+time')
    else:
        prep_tag = 'ø+'+tag
    cx2preptags[cx] = prep_tag
    
    for spec in features:    
        spec2result[spec].append(result)
    
specdata = pd.DataFrame(specdata).fillna(0)
    
print(specdata.shape[0], 'results logged...')
print(len(tag2result.keys()), 'separate tags logged...')

In [None]:
print(specdata.shape)
specdata.head()

In [None]:
#A.show(tag2result['NP'], condenseType='sentence')

## Clustering with Fuzzy C-Means

C-means is a fuzzy clustering method which allows us to model both strong tendencies and ambiguity in the data. As with K-means, C-means requires a certain number of clusters to be predetermined. In order to find the ideal number of clusters, we can iterate from 2 to N and measure the partition coefficient for each iteration. The coefficient tells how compact the cluster is.

In [None]:
specdata.values.shape

In [None]:
part_coefficients = [] # partition coefficients
nclusters = [] # number of clusters

# measure coefficients for n-clusters 2 to 30
for i in range(2, 31): 
    cntr, u, u0, d, jm, p, fpc = cmeans(specdata.values, i, 2, error=0.005, maxiter=1000, seed=13)
    part_coefficients.append(fpc)
    nclusters.append(i)

Visualize the coefficient scores for each n-cluster. This helps us to see what the ideal number of clusters should be. We have to balance between "lumping and splitting" (as Croft calls it). With more clusters, we will inevitably have more consistency but with less usefulness. Thus we need to find the number of clusters that give the greatest consistency with the least amount of clusters.

In [None]:
plt.figure(figsize=(20, 6))
plt.plot(nclusters, part_coefficients)
plt.xticks(nclusters)
plt.axvline(18, color='red', linestyle='--')
plt.xlabel('Number of Clusters')
plt.ylabel('Fuzzy Partition Coefficient')
plt.savefig(firstyear+'cmeans_clustering.png', dpi=300, bbox_inchex='tight')
plt.title(f'C-means Clustering Coefficients with Number of Clusters, from {nclusters[0]} to {nclusters[-1]}')
plt.show()

### Implement the Clusters

In [None]:
number_clusters = 18

cntr, u, u0, d, jm, p, fpc = cmeans(specdata.values, number_clusters, 2, error=0.005, maxiter=1000, seed=13)

In [None]:
u.shape

### Visualize Membership Coefficients within Clusters

Since these are fuzzy clusters, all clusters contain **all** the constructions. Each construction has a corresponding score, which tells how close it is to the mean within the cluster. This score helps us to visualize membership ambiguities.

In [None]:
examplematrix = []

for i, cluster in enumerate(u):
    clustdata = pd.DataFrame(cluster, index=specdata.columns).sort_values(by=0, ascending=False)
    egcx = int(clustdata.index[0])
    eg = surfaceToken(egcx)
    size = clustdata[clustdata[0] > 0.9].shape[0]
    
    examplematrix.append([cx2tag[egcx], eg, size])
    
    plt.figure(figsize=(12, 4))
    showdata = clustdata.values[:500]
    plt.plot(np.arange(showdata.shape[0]), showdata)
    plt.title(f'cluster {i}, {cx2tag[egcx]}, e.g. {reverse_hb(eg)}', size=16)
    plt.ylabel('Membership Coefficient', size=14)
    plt.xlabel('Constructions 1–N', size=14)
    plt.xticks(size=12)
    plt.yticks(size=12)
    plt.show()

In [None]:
cluster_examples = pd.DataFrame(examplematrix, columns=['Cluster Name', 'Example', 'Size']).set_index('Cluster Name')
cluster_examples = cluster_examples.sort_values(by='Size', ascending=False)
cluster_examples

In [None]:
cluster_examples['Size'].sum()

In [None]:
cluster_examples.to_excel(firstyear+'clusters.xlsx')

### Count Good Fits and Find Misfits

Which constructions do not find themselves in an ideal cluster? First find the number of strong fits. All of these clusters have top scores far above the others. We can essentially describe an arbitrary cutoff point above ~0.2. We also store all of the cluster mappings. 

In [None]:
strong_fits = set()
clust2cx = {}
clust2mainlabel = {}
cx2clust = {}
clustermatrix = []

clustmainlabel2cx = collections.defaultdict(list)


for i, cluster in enumerate(u):
    clustdata = pd.DataFrame(cluster, index=specdata.columns)
    clustermatrix.append(clustdata[0].values)
    good_fits = clustdata[clustdata[0] > 0.9]
    mainlabel = cx2tag[clustdata.sort_values(ascending=False, by=0).index[0]]
    clust2mainlabel[i] = mainlabel
    
    for cx in good_fits.index:
        cx2clust[cx] = i
        strong_fits.add(cx)
        clustmainlabel2cx[mainlabel].append(cx)
    clust2cx[i] = set(good_fits.index)
    
    
label2clust = dict((label, clust) for clust, label in clust2mainlabel.items())
clustermatrix = pd.DataFrame(np.array(clustermatrix).T, columns=np.arange(number_clusters), index=specdata.columns)

print('number of time constructions', freq_times['Total'].sum())
print('size of testset: ', len(testset))
print('number of strong fits:', len(strong_fits), '({}'.format(len(strong_fits) / freq_times['Total'].sum()), 'of all time constructions)')

#### Misfits

In [None]:
misfits = testset - strong_fits

print('number of misfits: ', len(misfits))

In [None]:
# for i, mf in enumerate(list(misfits)[:100]):
    
#     closest = clustermatrix.loc[mf].sort_values(ascending=False)
#     clust, score = closest.index[0], closest.values[0]

#     print(f'closest to: {clust2mainlabel[clust]} ({clust}) with score of {score}')
    
#     print(cx2tag[mf])
#     A.prettyTuple(L.d(mf, 'phrase'), seq=i, condensed=False, extraFeatures='pdp')

This fact is readily explained in the constructional framework: a proto-typical adverb is placed into a noun construction and construed as such.

## Gather Paper Data

In [None]:
len(tag2result)

In [None]:
total_cx = len(list(F.otype.s('construction')))

##### Number of +VC

In [None]:
# number of +VC specs

cons_VC = spec2result['cons+VC']
rela_VC = spec2result['rela+VC']
VC = spec2result['+VC']

tota_vc = len(cons_VC) + len(rela_VC) + len(VC)

tota_vc

In [None]:
tota_vc / len(list(F.otype.s('construction')))

In [None]:
#A.show(spec2result['+VC'])

In [None]:
formatPassages(cons_VC+rela_VC+VC)

##### Cases of Bare Plurals

In [None]:
sorted(tag2result['time.pl'])

In [None]:
len(tag2result['time.pl'])

In [None]:
T.sectionFromNode(1774787)

In [None]:
T.text(L.u(1774787,'verse')[0])

In [None]:
formatPassages(tag2result['time.pl'])

##### Time + cnstr + NP

In [None]:
len(spec2result['cons+NP'])

In [None]:
len(spec2result['cons+NP']) / len(list(F.otype.s('construction')))

In [None]:
sorted(spec2result['cons+NP'])[:10]

In [None]:
T.sectionFromNode(1774647)

In [None]:
T.text(L.u(1774647, 'verse')[0])

##### Adverbs

In [None]:
len(tag2result['time'])

In [None]:
len(tag2result['PPtime'])

In [None]:
len(tag2result['time']) + len(tag2result['PPtime'])

In [None]:
(len(tag2result['time']) + len(tag2result['PPtime'])) / len(list(F.otype.s('construction')))

In [None]:
T.sectionFromNode(672200)

In [None]:
T.text(L.u(672200, 'verse')[0])

##### Suffix

In [None]:
random.shuffle(spec2result['sffx'])

In [None]:
#A.show(spec2result['sffx'], condensed=False)

In [None]:
T.sectionFromNode(800626)

In [None]:
T.text(L.u(800626, 'verse')[0])

In [None]:
len(spec2result['sffx']) / total_cx

In [None]:
formatPassages(spec2result['sffx'])

In [None]:
#A.show(tag2result['PPtime.pl.sffx'])

In [None]:
T.sectionFromNode(654163 )

In [None]:
T.text(L.u(654163, 'verse')[0])

##### Demonstrative

In [None]:
len(spec2result['demon'])

In [None]:
len(spec2result['demon']) / total_cx

##### Definite Article Standalone

In [None]:
len(tag2result['time.H'])

In [None]:
len(tag2result['time.H']) / total_cx

##### Attributive Construction

In [None]:
len(spec2result['attr_patt'])

In [None]:
len(spec2result['attr_patt']) / total_cx

##### Attributed Example

In [None]:
A.show(tag2result['PPtime.attr_patt'])

In [None]:
[t for t in tag2result.keys() if 'attr_patt' in t]

In [None]:
A.show(tag2result['time.H.pl.attr_patt.adjv'])

In [None]:
T.sectionFromNode(870275)

In [None]:
T.text(L.u( 870275, 'verse')[0])

In [None]:
A.show(tag2result['time.H.attr_patt.adjv'])

##### Ordinals

In [None]:
len(spec2result['ord'])

In [None]:
len(spec2result['ord']) / total_cx

### How many cases of demonstrative ה are found in discourse?

In [None]:
textype = collections.Counter()

for res in tag2result['PPtime.H']+tag2result['time.H']:
    cx = res[0]
    clause = L.u(cx, 'clause')[0]
    
    txt = F.txt.v(clause)
    txt = 'S' if {'Q', 'D'} & set(txt) else txt
        
    textype[txt] += 1
    
textype

In [None]:
d = 380
n = 95 + 15 + 4 + 3

d / (d+n)

## Associations Between Prepositions and  Specifications

I have a hypothesis that the ל preposition may be attracted to plural endings, and that the concept of duration or distance may be crucial for understanding the difference between ל and a marker such as ב, which tends to indicate points in time rather than spans. It it difficult to know whether there will be any statistically significant attractions, given that  ְל can occur with durative terms that do not need the plural to become a duration (especially prototypical adverbs such as עולם). Thus I may try this analysis in a couple of steps. The first will look at all of the data, adverbial words included. Then I want to see if any associations are brought out by looking only at terms which *regularly* accept nominal endings. This can be a bit tricky, since even עולם *can* take nominal endings, as we saw in the tagging study of `prep.time` patterns above: 

In [None]:
A.show(lex2tag2result['עולם']['plural'])

### Preposition/øPreposition Associations Between Specifications

We can utilize the data processed in `specdata`, which contains both construction node ID's and the tagged features. 

This data is used to construct a co-occurrence matrix of feature x preposition. Non prepositional phrases are marked with null (øprep).

In [None]:
specdata.shape

In [None]:
specdata.head()

In [None]:
# build co-occurrence data

specprep_counts = collections.defaultdict(lambda:collections.Counter())

for cx in specdata.columns:
    
    phrase = L.d(cx, 'phrase')[0]
    # get features but filter out PPtime and time since that's accounted for below
    features = dict((spec, count) for spec, count in specdata[cx].to_dict().items()
                       if spec not in {'PPtime', 'time'})
    
    # count tag and feature co-occurrences
    if F.typ.v(phrase) == 'PP':
        prep = E.head.t(phrase)[0]
        specprep_counts[F.lex_utf8.v(prep)].update(features)
    else:
        specprep_counts['ø'].update(features)
        
specprep_counts = pd.DataFrame(specprep_counts)

print(specprep_counts.shape)

specprep_counts.head()

In [None]:
specprep_counts.columns # target prepositions

In [None]:
specprep_counts.index # co-occurring features

Next, convert co-occurrence counts to Fisher's exact associations.

In [None]:
specprep_assocs = apply_fishers(specprep_counts)

specprep_assocs.head()

Show associations. **Any value greater than 1.3 is statistically associated,** since the p-values have been log10 transformed. **Any value less than -1.3 is significantly repelled.**

In [None]:
for prep in specprep_assocs:
    assocs = specprep_assocs[prep].sort_values(ascending=False)
    print(prep)
    print(assocs)
    print('\n', '-'*20, '\n')

### Heatmap Visualization

In [None]:
# make copy of data to add reversed Hebrew script
heatmap_specprep_assocs = specprep_assocs.copy()
heatmap_specprep_assocs.columns = [reverse_hb(spec) for spec in specprep_assocs.columns]
plt.figure(figsize=(15, 8))
sns.heatmap(heatmap_specprep_assocs, center=0, robust=True)
plt.xticks(size=25)
plt.yticks(size=18)
plt.savefig(firstyear+'spec_attractions.png', dpi=300, bbox_inches='tight')
plt.title('Co-Spec Attractions (Fisher Exact)')
plt.show()

### Clustering Prepositions with PCA

Plotting these prepositions with PCA can give a sense of how similar/dissimilar these prepositions are to one another, as well as inform us which factors most strogly influence their separation. We do that below.

In [None]:
pca = PCA(10)
prep_fit = pca.fit(specprep_assocs.T.values)
pca_preps = prep_fit.transform(specprep_assocs.T.values)

preploadings = prep_fit.components_.T * np.sqrt(prep_fit.explained_variance_)
preploadings = pd.DataFrame(preploadings.T, index=np.arange(10)+1, columns=specprep_assocs.index)

plt.figure(figsize=(8, 6))
sns.barplot(x=np.arange(10)+1, y=prep_fit.explained_variance_ratio_[:10], color='darkblue')
plt.xlabel('Principle Component', size=16)
plt.ylabel('Raio of Explained Variance', size=16)
plt.title('Ratio of Explained Variance for Principle Components 1-10 (Scree Plot)', size=16)
plt.show()

In [None]:
plot_PCA(specprep_assocs, components=(pca_preps[:,0], pca_preps[:,1]), annoTags=[reverse_hb(token) for token in specprep_assocs.columns])

#### Visualize the top influencing features

In [None]:
# filter x & y
x_filt = pd.DataFrame(pca_preps[:,0], index=specprep_assocs.columns)
y_filt = pd.DataFrame(pca_preps[:,1], index=specprep_assocs.columns)
x_filt = x_filt[specprep_counts.sum() > 100]
y_filt = y_filt[specprep_counts.sum() > 100]

# make simple x,y
x, y = x_filt.values, y_filt.values

influences = list(preploadings[:2].min().sort_values().head(4).index) + list(preploadings[:2].max().sort_values(ascending=False).head(4).index)

# plot coordinates
plt.figure(figsize=(12, 10))
plt.scatter(x, y, color='black')
plt.xlabel('PC1', size=18)
plt.ylabel('PC2', size=18)
plt.axhline(color='red', linestyle=':')
plt.axvline(color='red', linestyle=':')

# annotate prepositions 
prep_xy = {} # for noun_dict
annoTags = x_filt.index
for i, prep in enumerate(annoTags):
    prep_x, prep_y = x[i], y[i]
    prep_xy[annoTags[i]] = (prep_x, prep_y)
    plt.annotate(reverse_hb(prep), xy=(prep_x, prep_y), size=26, fontname='Times New Roman')

# annotate loadings 
for feat in preploadings:
    if feat not in influences: # skip under-influencers
        continue
    x, y = preploadings[feat][:2]
    plt.arrow(0, 0, x, y, color='green')
    plt.annotate(feat, xy=(x*1.15, y*1.15), color='green', size=18)
    
plt.title('Prepositions (black) and the features (green) which influence their placements on PC1 and PC2; \n(features found on time noun they govern)', size=18, pad=20)
plt.show()

#### ל and בקר

ל seems to be associated with plurality. But it also occurs with terms like בקר "morning," which is a term that occurs 90+ times with ל's opposite: ב. Interestingly, the query below shows that 3 of 10 uses with בקר actually have "morning" in the plural! Could the singular uses represent a construal of a pointilliar time as a duration?

In [None]:
# A.show(A.search('''

# construction
#     phrase
#         =: word lex=L
#         <: word lex=H
#         <: word lex=BQR=/
# '''))

### Measuring Associations Between Specifiers

In [None]:
specdata.head()

In [None]:
specicollocations = collections.defaultdict(lambda: collections.Counter())

for cx in specdata.columns:
    
    pos_values = specdata[cx][specdata[cx] > 0]
    
    for speci in pos_values.index:
        for specj in pos_values.index:
            if speci == specj:
                continue
            else:
                specicollocations[speci][specj] += 1
                
specicollocations = pd.DataFrame(specicollocations).fillna(0)

In [None]:
specicollocations = specicollocations.reindex(sorted(specicollocations.index), axis=1) # reorder index by sort

In [None]:
specicollocations.head()

In [None]:
specicollocations_assoc = apply_fishers(specicollocations)

In [None]:
# change identical pairwise comparison scores to 0
for speci in specicollocations_assoc.columns:
    for specj in specicollocations_assoc.index:
        if speci == specj:
            specicollocations_assoc[speci][specj] = 0

In [None]:
specicollocations_assoc = specicollocations_assoc.reindex(np.abs(specicollocations_assoc).mean().sort_values().index, axis=1) # reindex based on mean of absolute value on axis 1

In [None]:
plt.figure(figsize=(15, 8))
plt.title('Co-Spec Attractions (Fisher Exact)')
sns.heatmap(specicollocations_assoc, center=1.3)
plt.show()

In [None]:
specicollocations_assoc['pl'].sort_values(ascending=False)

## Measuring In-Clause Constituent Order

Position from verb is represented as v+1 or v-1 etc.

**NB: Account for WJHJ...**

Count total number of time constructions in verbal clauses.

In [None]:
clause_kinds_raw = collections.Counter()

for cx in F.otype.s('construction'):
    clause = L.u(cx, 'clause')[0]
    clause_kinds_raw[F.kind.v(clause)] += 1
    
convert2pandas(clause_kinds_raw)

In [None]:
clause_kinds = collections.Counter()
timeorders = collections.Counter()
pos2res = collections.defaultdict(list)
order2res = collections.defaultdict(list)
order2tense2res = collections.defaultdict(lambda: collections.defaultdict(list))
orderbytense = collections.defaultdict(lambda: collections.Counter())
posbytense = collections.defaultdict(lambda: collections.Counter())
wayehi_cases = []


for cx in testset:
    clause = L.u(cx, 'clause')[0]
    cl_kind = F.kind.v(clause)
    clause_kinds[cl_kind] += 1
    
    if cl_kind != 'VC':
        continue
        
    time = next(ph for ph in L.d(clause, 'phrase') if F.function.v(ph) == 'Time')

    # get the clause's primary predicate
    if F.typ.v(clause) in {'Ptcp'}:
        pred = next(ph for ph in L.d(clause, 'phrase') if F.function.v(ph) in {'PtcO', 'PreC'})
    else:
        pred = next(ph for ph in L.d(clause, 'phrase') if F.function.v(ph) in {'Pred', 'PreS', 'PreO'})

    # check for ויהי
    # get next clause if so
    order = None
    verb = next(w for w in L.d(pred, 'word') if F.pdp.v(w) == 'verb')
    vt, lex, ps, gn, nu = filter_tense(verb), F.lex.v(verb), F.ps.v(verb), F.gn.v(verb), F.nu.v(verb)
    if all([vt in {'wayq', 'perf', 'weqt'}, lex == 'HJH[', ps == 'p3', gn == 'm', nu == 'sg']):
        try:
            clause_atom = L.d(clause, 'clause_atom')[0]
            next_clause = E.mother.t(clause_atom)[0]
            wayehi_cases.append([clause, L.d(cx, 'phrase')[0]])
            pred = next(ph for ph in L.d(clause, 'phrase') if F.function.v(ph) in {'Pred', 'PreS', 'PreO'})
            verb = next(w for w in L.d(pred, 'word') if F.pdp.v(w) == 'verb')        
            order = '-c'
        except:
            pass
        
    tense = filter_tense(verb)
    order = order or time-pred
    sign = '+' if type(order)==int and order > 0 else ''
    order_txt = f'{sign}{order}'
    timeorders[order_txt] += 1
    order2res[order_txt].append((cx, clause, time, pred))
    timepos = 'fronted' if order == '-c' or order < 0 else 'postverbal'

    pos2res[timepos].append((cx, clause, time, pred))
    orderbytense[order_txt][vt] += 1
    posbytense[timepos][vt] += 1
    order2tense2res[order_txt][vt].append((cx, clause, time, pred))

clause_kinds = convert2pandas(clause_kinds)
timeorders = convert2pandas(timeorders)
orderbytense = pd.DataFrame(orderbytense).fillna(0)
posbytense = pd.DataFrame(posbytense).fillna(0)

clause_kinds

In [None]:
len(wayehi_cases)

In [None]:
timeorders.to_excel(firstyear+'time_orders.xlsx')
display(timeorders.head(10))
countBarplot(timeorders, size=(10, 6), xlabel='Number of Constituents Away from Verb', save=firstyear+'timeposition.png')

Let's make a simpler distinction: pre-verbal or post-verbal...

In [None]:
pos_count = dict((pos, len(pos2res[pos])) for pos in pos2res)
pos_count = convert2pandas(pos_count)
pos_count['%'] = (pos_count / pos_count.sum()).round(2) * 100
pos_count.to_excel(firstyear+'pos_count.xlsx')
pos_count

In [None]:
timeorders.loc[['+2', '-1', '+1', '-2']].sum() / pos_count['Total'].sum()

In [None]:
#A.show(order2res[2], end=20)

### Looking at Order/Tense Associations

In [None]:
posbytense

In [None]:
pbt_assoc = apply_fishers(posbytense)

In [None]:
show_pbt_assoc = pbt_assoc.sort_values(by='fronted')

plt.figure(figsize=(4, 8))
sns.heatmap(show_pbt_assoc, center=0)
plt.yticks(size=20, rotation='horizontal')
plt.xticks(size=20)
plt.savefig(firstyear+'heatmap_timePOS.png', dpi=300, bbox_inches='tight')
plt.title('Time Adverbial Positions and Tenses (Fisher Exact)')
plt.show()

In [None]:
pcounts = posbytense[['postverbal', 'fronted']].sort_values(by='postverbal', ascending=False)
pcounts.to_excel(firstyear+'time_position.xlsx')
pcounts

## Verb Collocations 

### Method

This part of the analysis will seek to examine verb collocations against 3 reference points amongst time adverbials: direction, quantity, distance:

* direction - a preposition lexeme
* quantity - singular or plural (derived from pl, du, card, quant, qual)
* distance
    * near - e.g. ה, זה
    * far - e.g. היא, הוא
    
The end result will be a 3 part tag, with 9 possible combinations, e.g. **B.sg.near, L.pl.far**.

In order to build this data, we have to use a modified tagger function.

In [None]:
demon_maps = {'Z>T': 'near',
              'HJ>': 'far',
              'HMH': 'far',
              '>LH': 'near',
              'HM': 'far',
              'HW>': 'far',
              'ZH': 'near'}

cx2dqd = collections.defaultdict(set)
dqd2res = collections.defaultdict(list)

for cx in specdata.columns:
    features = specdata[cx]
    phrase = L.d(cx, 'phrase')[0]
    
    # -- TAG DIRECTION --
    if F.typ.v(phrase) == 'PP':
        prep = next(w for w in L.d(phrase, 'word') if F.pdp.v(w)=='prep')
        direct = F.lex.v(prep)
    else:
        direct = 'ø'
        
    # -- TAG DISTANCE --
    # for standalone H demonstrative tests
    standalone = not any([features['demon'], features['attr_patt'],  # ensure no other modifiers
                          features['quant'], features['PPtime']])
    # check demonstratives
    if features['demon']:
        demon = next(w for w in L.d(cx, 'word') if F.pdp.v(w) in {'prde'})
        
        dist = demon_maps[F.lex.v(demon)]
    # check for demonstrative H
    elif features['H'] and standalone:
        dist = 'near'
        
    else:
        dist = 'ø'
        
    # -- TAG QUANTITY -- 
    plurals = any([features['quant'], features['pl'], features['du'], features['qual']])
    if plurals:
        quant = 'pl'
    else:
        quant = 'sg'
        
    # configure time for adverbs
    is_advb = not set(features[features>0].index) - {'time', 'PPtime'} # make sure nothing else is present
    head = 'time' if not is_advb else F.lex.v(E.nhead.t(phrase)[0])
        
    # Direction, quantity, distance tag
    dqd = f'{direct}.{head}.{quant}.{dist}'
    cx2dqd[cx] = dqd
    dqd2res[dqd].append((cx, phrase))

In [None]:
dqd2res.keys()

In [None]:
dqdbyevent = collections.defaultdict(lambda: collections.Counter())
dqdbytense = collections.defaultdict(lambda: collections.Counter())
dqd2event2res = collections.defaultdict(lambda: collections.defaultdict(list))
event2res = collections.defaultdict(list)
dqd2tense2res = collections.defaultdict(lambda: collections.defaultdict(list))
wayehi_cases = []
wayehi_exceptions = []

for cx in testset:
        tag = cx2dqd[cx]
        clause = L.u(cx, 'clause')[0]
        
        if F.kind.v(clause) != 'VC': # skip non-verbal clauses
            continue
            
        # get the clause's primary predicate
        if F.typ.v(clause) in {'Ptcp'}:
            pred = next(ph for ph in L.d(clause, 'phrase') if F.function.v(ph) in {'PtcO', 'PreC'})
        else:
            pred = next(ph for ph in L.d(clause, 'phrase') if F.function.v(ph) in {'Pred', 'PreS', 'PreO'})
            
        # check for ויהי
        # get next clause if so
        verb = next(w for w in L.d(pred, 'word') if F.pdp.v(w) == 'verb')
        vt, lex, ps, gn, nu = F.vt.v(verb), F.lex.v(verb), F.ps.v(verb), F.gn.v(verb), F.nu.v(verb)
        if all([vt in {'wayq', 'perf'}, lex == 'HJH[', ps == 'p3', gn == 'm', nu == 'sg']):
            try:
                clause_atom = L.d(clause, 'clause_atom')[0]
                next_clause = E.mother.t(clause_atom)[0]
                wayehi_cases.append([clause, L.d(cx, 'phrase')[0], next_clause])
                clause = L.u(next_clause, 'clause')[0]
                pred = next(ph for ph in L.d(clause, 'phrase') if F.function.v(ph) in {'Pred', 'PreS', 'PreO'})
                verb = next(w for w in L.d(pred, 'word') if F.pdp.v(w) == 'verb')
            except:
                wayehi_exceptions.append([L.d(cx, 'phrase')[0], next_clause])
                continue # skip them
            
            
        # check for obj/cmpl arguments
        obj_cmpl = set(ph for ph in L.d(clause,'phrase') if F.function.v(ph) in {'Objc', 'Cmpl'})
        sffx_obj = F.function.v(pred) in {'PtcO', 'PreO'}
        oc_check = '+obj/cmp' if any([obj_cmpl, sffx_obj]) else ''
        
        # tokenize the predicate
        vs, lex, vt = F.vs.v(verb), F.lex.v(verb), filter_tense(verb)
        verb_token = f'{lex}.{vs}'
        
        # count co-occurrence
        result = (cx, clause, L.d(cx, 'phrase')[0],  verb)
        dqdbyevent[tag][verb_token] += 1
        dqdbytense[tag][vt] += 1
        dqd2event2res[tag][verb_token].append(result)
        dqd2tense2res[tag][vt].append(result)
        event2res[verb_token].append(result)
        
dqdbyevent = pd.DataFrame(dqdbyevent).fillna(0)
dqdbytense = pd.DataFrame(dqdbytense).fillna(0)

print(dqdbyevent.shape)
print(len(wayehi_cases), 'wayehi cases handled...')
print(len(wayehi_exceptions), 'wayehi exceptions ignored...')

dqdbyevent.head()

In [None]:
dqdbyevent.shape

### Look at Raw Associations

In [None]:
dqdbyevent2 = dqdbyevent.drop('VM>[.qal') # remove outlier

In [None]:
dbe_assoc = apply_fishers(dqdbyevent2)
dbe_assoc.head()

In [None]:
dbe_assoc.max().sort_values(ascending=False).head(10)

In [None]:
dbe_assoc.min().sort_values().head(10)

In [None]:
top_strongest = dbe_assoc.max().sort_values(ascending=False).head(10).index
compare = dbe_assoc[top_strongest].sort_values(by='ø.time.pl.ø', ascending=False)

#compare = compare.reindex(compare.T.quantile(0.25).sort_values().index).head(20) # get the most polarizing adverbials

plt.figure(figsize=(10, 8))
plt.title('Time Adverbial and Tense Attractions (Fisher Exact)')
sns.heatmap(compare, center=0)
plt.yticks(size=10)
plt.xticks(size=20, rotation='vertical')
plt.show()

### PCA Tests

In [None]:
pca = PCA(10)
dqd_fit = pca.fit(dbe_assoc.T.values)
pca_dqd = dqd_fit.transform(dbe_assoc.T.values)

dqdloadings = dqd_fit.components_.T * np.sqrt(dqd_fit.explained_variance_)
dqdloadings = pd.DataFrame(dqdloadings.T, index=np.arange(10)+1, columns=dbe_assoc.index)

plt.figure(figsize=(8, 6))
sns.barplot(x=np.arange(10)+1, y=dqd_fit.explained_variance_ratio_[:10], color='darkblue')
plt.xlabel('Principle Component', size=16)
plt.ylabel('Raio of Explained Variance', size=16)
plt.title('Ratio of Explained Variance for Principle Components 1-10 (Scree Plot)', size=16)
plt.show()

In [None]:
def plot_PCA(pca_nouns, 
             zoom=tuple(), 
             noun_xy_dict=False, 
             save='', 
             annotate=True, 
             title='', 
             components=tuple(),
             annoTags=[],
             anno_size='18'
            ):
    '''
    Plots a PCA noun space.
    Function is useful for presenting various zooms on the data.
    '''
    
    x, y = components
    
    # plot coordinates
    plt.figure(figsize=(12, 10))
    plt.scatter(x, y, s=50)

    if zoom:
        xmin, xmax, ymin, ymax = zoom
        plt.xlim(xmin, xmax)
        plt.ylim(ymin, ymax)
    
    if title:
        plt.title(title, size=18)
    plt.xlabel('PC1', size=18)
    plt.ylabel('PC2', size=18)
    plt.axhline(color='red', linestyle=':')
    plt.axvline(color='red', linestyle=':')
    
    # annotate points
    if annotate:
        noun_xy = {} # for noun_dict
        noun_lexs = annoTags
        
        for i, noun in enumerate(noun_lexs):
            noun_x, noun_y = x[i], y[i]
            noun_xy[annoTags[i]] = (noun_x, noun_y)
            if zoom: # to avoid annotating outside of field of view (makes plot small)
                if any([noun_x < xmin, noun_x > xmax, noun_y < ymin, noun_y > ymax]):                
                    continue # skip noun
            plt.annotate(noun, xy=(noun_x, noun_y), size=anno_size)
    
    if save:
        plt.savefig(save, dpi=300, bbox_inches='tight')
    
    
    plt.show()
    
    if noun_xy_dict:
        return noun_xy

In [None]:
plot_PCA(dbe_assoc, components=(pca_dqd[:,0], pca_dqd[:,1]), annoTags=dbe_assoc.columns, anno_size=10)

In [None]:
plot_PCA(dbe_assoc, zoom=(-5, 5, -5, 5), components=(pca_dqd[:,0], pca_dqd[:,1]), annoTags=dbe_assoc.columns)

In [None]:
plot_PCA(dbe_assoc, zoom=(-0.6, 0.5, -1, 0.5), components=(pca_dqd[:,0], pca_dqd[:,1]), annoTags=dbe_assoc.columns)

In [None]:
# influences = list(dqdloadings[:2].min().sort_values().head(5).index) + list(dqdloadings[:2].max().sort_values(ascending=False).head(5).index)

# x, y = (pca_dqd[:,0], pca_dqd[:,1])

# # plot coordinates
# plt.figure(figsize=(12, 10))
# plt.scatter(x, y, color='black')
# plt.xlabel('PC1', size=18)
# plt.ylabel('PC2', size=18)
# plt.axhline(color='red', linestyle=':')
# plt.axvline(color='red', linestyle=':')

# zoom = (-10, 10, -7, 7)
# plt.xlim(zoom[0], zoom[1])
# plt.ylim(zoom[2], zoom[3])


# #plt.savefig('plots/duration/conj_PCA_biplot.png', dpi=300)
    
# plt.show()

In [None]:
def show_dqd():
    x, y = pd.DataFrame(pca_dqd[:,0], index=dqdbyevent.columns), pd.DataFrame(pca_dqd[:,1], index=dqdbyevent.columns)
    xy = pd.concat([x, y], 1)
    xy.columns = ['x', 'y']

    axy = xy[xy.index.str.contains('pl') & xy.index.str.startswith('ø')] # red, ø+pl
    bxy = xy[xy.index.str.startswith('ø') & xy.index.str.contains('sg')] # +ø+sg
    cxy = xy.loc[[i for i in xy.index if i not in set(axy.index)|set(bxy.index)]] # +prep

    # plot coordinates
    plt.figure(figsize=(15, 6))
    
    ax1 = plt.scatter(axy['x'], axy['y'], s=dqdbyevent.sum()[axy.index], color='red')
    ax2 = plt.scatter(cxy['x'], cxy['y'], s=dqdbyevent.sum()[cxy.index], color='blue')
    ax3 = plt.scatter(bxy['x'], bxy['y'], s=dqdbyevent.sum()[bxy.index], color='black', alpha=0.5)
    
    plt.legend(['øprep + pl', 'prep + sg/pl', 'øprep + sg'], loc='upper right', fontsize=18)

    plt.axhline(color='black', linewidth=0.6)
    plt.axvline(color='black', linewidth=0.6)
    
    plt.axis('scaled')
    
    zoom=False
    if zoom:
        xmin, xmax, ymin, ymax = zoom
        plt.xlim(xmin, xmax)
        plt.ylim(ymin, ymax)

    title = ''
    if title:
        plt.title(title, size=18)
    plt.xlabel('PC1', size=18)
    plt.ylabel('PC2', size=18)
    
# for lex in dqdloadings:
    
#     if lex not in influences:
#         continue
    
#     x, y = dqdloadings[lex][:2]
#     plt.arrow(0, 0, x, y, color='green')
    
#     # handle zooms
#     if any([x < zoom[0], x > zoom[1], y < zoom[2], y > zoom[3]]):                
#         continue
        
#     plt.annotate(lex, xy=(x*1.15, y*1.15), color = 'green', size=10)
    
    plt.savefig(firstyear+'aspect_pca.png', dpi=300, bbox_inches='tight')
    
    plt.show()
    
show_dqd()

In [None]:
pd.DataFrame(dqdloadings.loc[1].sort_values(ascending=False).head(30)).to_excel(firstyear+'durative_loadings.xlsx')
dqdloadings.loc[1].sort_values(ascending=False).head(30)

In [None]:
dqdloadings.loc[1].sort_values().head(30)

### Surprising Cases

#### The Durative of Intent 

In [None]:
# A.show(event2res['SGR[.hif'])

In [None]:
T.sectionFromNode(686186)

In [None]:
T.text(L.u(686186, 'verse')[0])

In [None]:
# A.show(dqd2event2res['ø.time.pl.ø']['>KL[.qal'])

In [None]:
T.sectionFromNode(652380)

In [None]:
T.text(L.u(652380, 'verse')[0])

### Identify Statistically Insignificant Cases

In [None]:
# surprises = collections.defaultdict(lambda:collections.defaultdict(list))

# for clust, events in clust2event2res.items():
#     for event in events:
        
#         results = clust2event2res[clust][event]
#         # check association score
#         assoc = cbe_assoc[clust][event]
#         if assoc < 0:
#             surprises[clust][event].extend(results)
            
# len(surprises)

In [None]:
# for key in surprises:
#     print(key, '\t', len(surprises[key]))

In [None]:
# for clust, events in surprises.items():
#     for event in events:
#         print(f'cluster: {clust}')
#         print(f'event: {event}')
#         print(f'assoc: {cbe_assoc[clust][event]}')
#         A.show(surprises[clust][event])
#         print('-'*20)

In [None]:
#dbe_assoc.loc['BW>[.qal+obj/cmp'].sort_values(ascending=False).head(10)

In [None]:
# dqdbyevent.loc['בוא.qal+obj/cmp'].sort_values(ascending=False)

In [None]:
# A.show(clust2event2res['time.pl.quant.card']['בוא.qal+obj/cmp'])

In [None]:
# cbe_assoc.loc['אמר.qal'].sort_values(ascending=False)

In [None]:
# A.show(clust2event2res['PPtime.pl']['אמר.qal'])

In [None]:
# A.show(clust2event2res['PPtime.H.pl.attr_patt.demon']['אמר.qal'])

### Durative & Verb Preferences

In [None]:
# cbe_assoc[duratives].max(1).sort_values(ascending=False)

In [None]:
# clusterbyevent[duratives].sum().sum()

### Tense Tests

In [None]:
dqdbytense

### Examples for Paper

In [None]:
formatPassages(dqd2tense2res['ø.time.sg.near']['ptca'])

In [None]:
cbt_assoc = apply_fishers(dqdbytense)

### Look at Raw Associations

In [None]:
cbt_assoc.max().sort_values(ascending=False).head(10)

In [None]:
cbt_assoc.min().sort_values().head(10)

### PCA Tests

In [None]:
pca = PCA(9)
cbt_fit = pca.fit(cbt_assoc.T.values)
pca_cbt = cbt_fit.transform(cbt_assoc.T.values)

cbtloadings = cbt_fit.components_.T * np.sqrt(cbt_fit.explained_variance_)
cbtloadings = pd.DataFrame(cbtloadings.T, index=np.arange(9)+1, columns=cbt_assoc.index)

plt.figure(figsize=(8, 6))
sns.barplot(x=np.arange(9)+1, y=cbt_fit.explained_variance_ratio_[:9], color='darkblue')
plt.xlabel('Principle Component', size=16)
plt.ylabel('Raio of Explained Variance', size=16)
plt.title('Ratio of Explained Variance for Principle Components 1-9 (Scree Plot)', size=16)
plt.show()

In [None]:
plot_PCA(cbt_assoc,  components=(pca_cbt[:,0], pca_cbt[:,1]), annoTags=cbt_assoc.columns, anno_size=12)

In [None]:
x, y = (pca_cbt[:,0], pca_cbt[:,1])

# plot coordinates
plt.figure(figsize=(12, 10))
plt.scatter(x, y, color='black')
plt.xlabel('PC1', size=18)
plt.ylabel('PC2', size=18)
plt.axhline(color='red', linestyle=':')
plt.axvline(color='red', linestyle=':')

for verbconj in ('wayq', 'ptca', 'impf'):
    x, y = cbtloadings[verbconj][:2]
    plt.arrow(0, 0, x, y, color='green')
    plt.annotate(verbconj, xy=(x*1.15, y*1.15), color = 'green', size=20)
        
plt.show()

### Exploring the Contribution of Prepositions

In [None]:
# def show_cbt(zoom=None):

#     x, y = pd.DataFrame(pca_cbt[:,0], index=dqdbyevent.columns), pd.DataFrame(pca_cbt[:,1], index=dqdbytense.columns)
#     xy = pd.concat([x, y],1)
#     xy.columns = ['x', 'y']

# #     axy = xy[xy.index.str.contains('sg')]
# #     bxy = xy[xy.index.str.contains('pl')]
# #     cxy = xy.loc[[i for i in xy.index if i not in set(axy.index)|set(bxy.index)]]

#     axy = xy[xy.index.str.startswith('MN')] # blue
#     bxy = pd.concat([xy[xy.index.str.startswith('<D')], xy[xy.index.str.startswith('L')]]) # red
#     dxy = xy.loc[[i for i in xy.index if i not in set(axy.index)|set(bxy.index)]] # grey

#     # plot coordinates
#     plt.figure(figsize=(12, 10))
#     plt.scatter(axy['x'], axy['y'], s=dqdbyevent.sum()[axy.index], color='blue')
#     plt.scatter(bxy['x'], bxy['y'], s=dqdbyevent.sum()[bxy.index], color='red')
#     plt.scatter(dxy['x'], dxy['y'], s=dqdbyevent.sum()[dxy.index], color='grey', alpha=0.5)

# #     mn = 'ןמ'
# #     lad = 'ל & דע'
# #     plt.legend([mn, lad], loc='lower left', fontsize=25)
    
#     if zoom:
#         xmin, xmax, ymin, ymax = zoom
#         plt.xlim(xmin, xmax)
#         plt.ylim(ymin, ymax)

#     title = ''
#     if title:
#         plt.title(title, size=18)
#     plt.xlabel('PC1', size=18)
#     plt.ylabel('PC2', size=18)
    
#     plt.axhline(color='red', linestyle=':')
#     plt.axvline(color='red', linestyle=':')

#     annotate = False
#     # annotate points
#     if annotate:
#         noun_xy = {} # for noun_dict
#         noun_lexs = annoTags

#         for i, noun in enumerate(noun_lexs):
#             noun_x, noun_y = x[i], y[i]
#             noun_xy[annoTags[i]] = (noun_x, noun_y)
#             if zoom: # to avoid annotating outside of field of view (makes plot small)
#                 if any([noun_x < xmin, noun_x > xmax, noun_y < ymin, noun_y > ymax]):                
#                     continue # skip noun
#             plt.annotate(noun, xy=(noun_x, noun_y), size=anno_size)
            
#     for verbconj in ('ptca', 'wayq', 'impf', 'perf'):
#         x, y = cbtloadings[verbconj][:2]
#         plt.arrow(0, 0, x, y, color='green')
#         plt.annotate(verbconj, xy=(x*1.15, y*1.15), color = 'green', size=16)
            
# #    plt.title('Opposition of ל & דע over against ןמ, based on their verb collocation preferences')
            
#     plt.show()

# show_cbt()

#### Looking at Loadings

In [None]:
cbtloadings.loc[1].sort_values(ascending=False)

In [None]:
#A.show(clust2tense2res['time.H']['ptca'])

In [None]:
#A.show(clust2tense2res['time.pl.quant.card']['wayq'])

In [None]:
compare = cbt_assoc.T
compare = compare.reindex(compare.T.quantile(0.25).sort_values().index).head(20) # get the most polarizing adverbials

plt.figure(figsize=(10, 8))
sns.heatmap(compare, center=0)
plt.yticks(size=14)
plt.xticks(size=20, rotation='vertical')
plt.savefig(firstyear+'heatmap_tenses.png', dpi=300, bbox_inches='tight')
plt.title('Time Adverbial and Tense Attractions (Fisher Exact)')
plt.show()

In [None]:
compare = cbt_assoc.loc[['wayq', 'impf']].T.sort_values(by='wayq', ascending=False).head(10)

plt.figure(figsize=(5, 6))
sns.heatmap(compare, center=0)
plt.yticks(size=15)
plt.savefig(firstyear+'heatmap_wayq_yiqt.png', dpi=300, bbox_inches='tight')
plt.title('Time Adverbial and Tense Attractions (Fisher Exact)')
plt.show()

In [None]:
compare = cbt_assoc.loc[['wayq', 'impf']].T.sort_values(by='impf', ascending=False).head(10)

plt.figure(figsize=(5, 6))
sns.heatmap(compare, center=0)
plt.yticks(size=15)
plt.savefig(firstyear+'heatmap_yqtl_wyqt.png', dpi=300, bbox_inches='tight')
plt.title('Time Adverbial and Tense Attractions (Fisher Exact)')
plt.show()

In [None]:
compare = cbt_assoc.loc[['wayq', 'ptca']].T.sort_values(by='ptca', ascending=False).head(10)

plt.figure(figsize=(5, 6))
plt.title('Time Adverbial and Tense Attractions (Fisher Exact)')
sns.heatmap(compare, center=1.3)
plt.yticks(size=15)
plt.show()

In [None]:
compare = cbt_assoc.loc[['wayq']].T.sort_values(by='wayq', ascending=False).head(10)

plt.figure(figsize=(4, 6))
plt.title('Time Adverbial and Tense Attractions (Fisher Exact)')
sns.heatmap(compare, center=1.3, robust=True)
plt.yticks(size=12)
plt.show()