*best viewed in [nbviewer](https://nbviewer.jupyter.org/github/CambridgeSemiticsLab/BH_time_collocations/blob/master/results/notebooks/yiqtol_nom_advb.ipynb)*

<center><h1>Yiqtol Collocations with Nominal versus Adverbial Heads</h1></center>
<center><h2 style="font-weight:normal">Cody Kingham</h2></center>
<center><h2><a href="../../docs/sponsors.md"><img height=15% width=15% src="../../docs/images/CambridgeU_BW.png"></a></h2></center>

In [1]:
! echo "last updated:"; date

## Introduction

In an earlier analysis, I found some evidence that the *yiqtol* verb tends to prefer time adverbial heads that do not nominalize with features such as plural endings or definite articles (see results [here](https://github.com/CambridgeSemiticsLab/BH_time_collocations/blob/master/archive/2019-10-31/analysis/exploratory/construction_clusters.ipynb)). In this notebook, we will ask that question more specifically, namely:

**Does yiqtol show preference, generally, for non-nominal time adverbial heads?**

By "non-nominal" I mean what some often call "particles." Common examples would be עתה, אז,עולם, etc.

In asking this question, I do not assume a classic definition of parts of speech, but rather a definition based on a construction grammar approach. The constructional approach to parts-of-speech does not assume universal categories. Categories like "adjective," "adverb," and even "noun" or "verb" are language specific. Not only that, they are not always closed, neatly-defined categories, as words can exist on a continuum.

**In asking the above question about *yiqtol*, we must simultaneously ask the same question about all other verb tenses.**

## Nominal Hypothesis: quantification as a distinguishing marker

Which words should we consider "nominal" and which "particles"?

The working hypothesis of this inquiry is that noun-like words are identified not by an assumed word class (e.g. "noun" versus "adverb"), but by the constructions within which a word regularly appears. Constructions used for indicating quantification are particularly relevant for the noun/particle distinction. These include plural noun endings and definite article constructions, but also count-noun constructions (e.g. "three years").

These markers are predicted to occur with nouns when aligned to a conceptual space, as seen in Croft (2001: 99) for English:

<img height=30% width=30% src="../../docs/images/figures/Croft_2001_POS_map.jpg">

Croft notes two axes along with language encodes certain functions. The y-axis denotes an object-to-action continuum, whereas the x-axis denotes a reference-to-predication continuum. 

The words we are seeking to distinguish in this notebook are those which are constrained as objects with reference ("nouns", the upper-leftmost quandrant) versus those that do not, "particles". Why such a clean two-way division? It is because our dataset contains only time adverbials. And amongst those adverbials, we primarily see nouns or particles. We do not expect verbs, for instance, since this dataset excludes infinitival adverbs.

### Statistical Association with Quantification

Quantification constructions will not always co-occur with a given word lexeme. For instance, in cases where a lexeme is singular. Thus, we will classify lexemes by their overall statistical tendency. The question is this: is a given lexeme statistically associated with quantification? If so, we label it as nominal. If it is statistically disassociated, we label it as a particle. 

## Nominal versus Particle

Once we have classified the lexemes based on their statistical tendencies, we can use those classes for collocation measurements with yiqtol and the other tenses.

<hr>

<center><h2>Python</h2></center>

## Import Modules and Data

In [2]:
# standard packages
from pathlib import Path
import collections
import networkx as nx
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# custom package in /tools
from paths import figs
from helpers import Figures, convert2pandas
from tf_tools.load import load_tf
from tf_tools.tokenizers import tokenize_surface
from cx_analysis.load import cxs
from cx_analysis.search import SearchCX
from stats.significance import contingency_table, apply_fishers

TF, API, A = load_tf(silent='deep')
A.displaySetup(condenseType='phrase')
F, E, T, L = A.api.F, A.api.E, A.api.T, A.api.L
se = SearchCX(A)
phrase2cxs = cxs['phrase2cxs']
class2cx = cxs['class2cx']
time_cxs = list(phrase2cxs.values())
sns.set(font_scale=1.5, style='whitegrid')

## Measure Quantification Associations 

We now measure the quantification tendencies of time adverbial heads and build the noun/particle categories.

For each lexeme, we iterate through all of its occurrences in the Hebrew Bible. We then make a count for each occurrence on whether it appears with any of the following constructions:

* plural noun ending
* quantification with a cardinal number or qualitative quantifier (e.g. כל)
* modification with a definite article
* modification with a demonstrative pronoun 

We will *exclude* the following contexts:

* any phrase with a verbal (infinitive) head
* any phrase with a prepositional head\*
* non-verbal clauses
* multi-phrasal time adverbials
* any clause with a non-finite main verb 

The selection of single time adverbials is made below.

**Prepositional heads deploy plural markers but do not function like normal nominals.* 

In [58]:
single_tas = class2cx['single'] - class2cx['component']

print(f'Number of single time adverbials: {len(single_tas)}')

Number of single time adverbials: 3823


The remaining time adverbials are now further filtered based on the exclusions noted above. They will be stored based on the lexeme string of their heads.

As we iterate through the adverbials, we will also take the opportunity to observe the type of verb the adverbial collocates with in that particular context. A count is made, per head lexeme, on what verb is collocated.

In [79]:
# map to head lexemes here
head2count = collections.defaultdict(lambda:collections.Counter()) # count: nom, ønom
contexts = collections.defaultdict(lambda:collections.defaultdict(list))
head2verbcount = collections.defaultdict(lambda:collections.Counter()) 

tense_map = {'ptca': 'ptcp'} # shorten "active participle" string

# apply filters to the tas, co
for ta in single_tas:
    
    head = list(ta.getsuccroles('head'))[-1] # head selected from graph
    head_cx = next(iter(ta.graph.pred[head]))
    
    # apply exclusions / filter
    clause = L.u(head, 'clause')[0]
    verb = next((w for w in L.d(clause,'word') if F.pdp.v(w) == 'verb'), None)
    tense = tense_map.get(F.vt.v(verb), F.vt.v(verb))
    exclusions = [
        not verb,
        F.sp.v(head) == 'verb',
        head_cx.name == 'prep',
        tense not in {
            'yqṭl', 'qṭl', 
            'wyqṭl', 'wqṭl',
            'impv', 'ptcp',
        },
    ]
    if any(exclusions):
        continue
    
    # make the collocation counts
    nominal_markers = (
        F.nu.v(head) in {'du', 'pl'},
        'quantified' in ta.classification,
        'definite' in ta.classification,
        'demonstrative' in ta.classification,
    )
    tag = 'nom' if any(nominal_markers) else 'ønom'
    head_lex = F.lex_sbl.v(head)
    head2count[head_lex][tag] += 1
    head2verbcount[head_lex][tense] += 1
    contexts[head_lex][tag].append(ta)
    contexts[head_lex][tense].append(ta)
    
head_collocates = pd.DataFrame.from_dict(head2count, orient='index').fillna(0)
verb_collocates = pd.DataFrame.from_dict(head2verbcount, orient='index').fillna(0)

We now have two principle datasets, stored in `head_collocates` and `verb_collocates`. Let's have a look at their contents, begining with the shape of each dataset.

In [80]:
print(head_collocates.shape)

(97, 2)


In [81]:
print(verb_collocates.shape)

(97, 6)


The top of each dataset is seen below.

In [82]:
head_collocates.head(5)

Unnamed: 0,nom,ønom
ywm-N,971.0,159.0
ʿrb-N,71.0,1.0
mṣʿr-N,1.0,0.0
lylh-N,74.0,34.0
ʿt-N,78.0,46.0


In [83]:
verb_collocates.head(5)

Unnamed: 0,qṭl,yqṭl,ptcp,impv,wyqṭl,wqṭl
ywm-N,331.0,329.0,89.0,28.0,257.0,96.0
ʿwd-N,42.0,198.0,24.0,5.0,48.0,8.0
ʿrb-N,8.0,29.0,2.0,0.0,13.0,20.0
mhrh-N,2.0,3.0,0.0,7.0,2.0,2.0
mṣʿr-N,1.0,0.0,0.0,0.0,0.0,0.0


The `contexts` dictionary provides access to the particular instances seen in the data. For example, if we want to access the one case of `ʿrb` used without explicit nominalizers, we can select it and display it like this:

In [85]:
for cx in contexts['ʿrb-N']['ønom']:
    se.showcx(cx, condenseType='clause')

{   '__cx__': 'prep_ph',
    'head': {'__cx__': 'cont', 'head': 328456},
    'prep': {'__cx__': 'prep', 'head': 328455}}



We can also access tense contexts this way:

In [89]:
for cx in contexts['mṣʿr-N']['qṭl']:
    se.showcx(cx, condenseType='clause')

{   '__cx__': 'prep_ph',
    'head': {   '__cx__': 'defi_ph',
                'art': {'__cx__': 'art', 'head': 233761},
                'head': {'__cx__': 'cont', 'head': 233762}},
    'prep': {'__cx__': 'prep', 'head': 233760}}



## Identifying Frequently Nominalized Words

One possible way to identify words that are regularly nominalized is to use a measure of statistical assocation. The standard measure in this project is the Fisher's exact method. We apply it below to the `head_collocates` dataset. Any resulting value >1.3 or <-1.3 is considered statistically significant, with the positive value representing positive association and the negative value being negative association. 

In [91]:
head_assoc = apply_fishers(head_collocates)

head_assoc.head()

Unnamed: 0,nom,ønom
ywm-N,159.481461,-159.481461
ʿrb-N,16.872612,-16.872612
mṣʿr-N,-0.0,0.0
lylh-N,2.245085,-2.245085
ʿt-N,1.012334,-1.012334


To isolate cases that are positively associated with nominal markers, we can select all cases with a Fisher's score `>1.3`:

In [92]:
nominals = pd.DataFrame(head_assoc['nom'][head_assoc['nom'] > 1.3])

nominals.shape

(12, 1)

Here are the heads in descending order of significance.

In [94]:
nominals.sort_values(by='nom', ascending=False)

Unnamed: 0,nom
ywm-N,159.481461
s̆nh-N,32.0981
ʿrb-N,16.872612
bqr-N2,11.342699
pʿm-N,9.638066
pnh-N,3.326652
ḥds̆-N2,3.279716
ṣhrym-N,3.038424
lylh-N,2.245085
dbr-N,1.794882


We see that `ywm` is most strongly associated with nominalizers, followed by other terms such as `s̆nh` and `ʿrb`.

Below we isolate the particles in the same way.

In [97]:
particles = pd.DataFrame(headcounts_assoc['nom'][headcounts_assoc['nom'] < -1.3])

particles.shape

(23, 1)

In [98]:
particles.sort_values(by='nom')

Unnamed: 0,nom
ʿwd-N,-123.610507
ʾz-P,-42.785489
ʿwlm-N,-39.107762
ʿth-P,-29.690833
kn-P,-17.66352
tmyd-N,-13.028554
mḥr-N,-11.608481
nṣḥ-N,-8.776832
ptʾm-P,-8.070681
mḥrt-N,-8.070681


## Collocation Tests

In [16]:
nominal_collocations = verbcounts.loc[nominals.index].sum()
particle_collocations = verbcounts.loc[particles.index].sum()

nominal_ratios = nominal_collocations / nominal_collocations.sum()
particle_ratios = particle_collocations / particle_collocations.sum()

In [30]:
nom_part = pd.concat([nominal_collocations, particle_collocations], axis=1)
nom_part.columns = ['nom', 'ønom']

In [31]:
nom_part

Unnamed: 0,nom,ønom
qṭl,512.0,209.0
yqṭl,468.0,546.0
ptcp,117.0,62.0
impv,57.0,35.0
wyqṭl,478.0,128.0
wqṭl,149.0,42.0


In [32]:
nom_part_assoc = apply_fishers(nom_part)

In [33]:
nom_part_assoc

Unnamed: 0,nom,ønom
qṭl,5.921881,-5.921881
yqṭl,-45.727892,45.727892
ptcp,0.200139,-0.200139
impv,-0.129414,0.129414
wyqṭl,19.141985,-19.141985
wqṭl,4.949011,-4.949011


In [19]:
nominal_ratios

qṭl      0.287479
yqṭl     0.262774
ptcp      0.065693
impv      0.032004
wyqṭl    0.268389
wqṭl     0.083661
dtype: float64

In [20]:
particle_ratios

qṭl      0.204501
yqṭl     0.534247
ptcp      0.060665
impv      0.034247
wyqṭl    0.125245
wqṭl     0.041096
dtype: float64