*best viewed in [nbviewer](https://nbviewer.jupyter.org/github/CambridgeSemiticsLab/BH_time_collocations/blob/master/results/notebooks/yiqtol_nom_advb.ipynb)*

<center><h1>Yiqtol Collocations with Nominal versus Adverbial Heads</h1></center>
<center><h2 style="font-weight:normal">Cody Kingham</h2></center>
<center><h2><a href="../../docs/sponsors.md"><img height=15% width=15% src="../../docs/images/CambridgeU_BW.png"></a></h2></center>

In [4]:
! echo "last updated:"; date

last updated:
Thu  2 Jan 2020 16:43:45 GMT


## Introduction

In an earlier analysis, I found some evidence that the *yiqtol* verb tends to prefer time adverbial heads that do not nominalize with features such as plural endings or definite articles (see results [here](https://github.com/CambridgeSemiticsLab/BH_time_collocations/blob/master/archive/2019-10-31/analysis/exploratory/construction_clusters.ipynb)). In this notebook, we will ask that question more specifically, namely:

**Does yiqtol show preference, generally, for non-nominal time adverbial heads?**

By "non-nominal" I mean what some often call "particles." Common examples would be עתה, אז,עולם, etc.

In asking this question, I do not assume a classic definition of parts of speech, but rather a definition based on a construction grammar approach. The constructional approach to parts-of-speech does not assume universal categories. Categories like "adjective," "adverb," and even "noun" or "verb" are language specific. Not only that, they are not always closed, neatly-defined categories, as words can exist on a continuum.

**In asking the above question about *yiqtol*, we must simultaneously ask the same question about all other verb tenses.**

## Nominal Hypothesis: quantification as a distinguishing marker

Which words should we consider "nominal" and which "particles"?

The working hypothesis of this inquiry is that noun-like words are identified not by an assumed word class (e.g. "noun" versus "adverb"), but by the constructions within which a word regularly appears. Constructions used for indicating quantification are particularly relevant for the noun/particle distinction. These include plural noun endings and definite article constructions, but also count-noun constructions (e.g. "three years").

These markers are predicted to occur with nouns when aligned to a conceptual space, as seen in Croft (2001: 99) for English:

<img height=30% width=30% src="../../docs/images/figures/Croft_2001_POS_map.jpg">

Croft notes two axes along with language encodes certain functions. The y-axis denotes an object-to-action continuum, whereas the x-axis denotes a reference-to-predication continuum. 

The words we are seeking to distinguish in this notebook are those which are constrained as objects with reference ("nouns", the upper-leftmost quandrant) versus those that do not, "particles". Why such a clean two-way division? It is because our dataset contains only time adverbials. And amongst those adverbials, we primarily see nouns or particles. We do not expect verbs, for instance, since this dataset excludes infinitival adverbs.

### Statistical Association with Quantification

Quantification constructions will not always co-occur with a given word lexeme. For instance, in cases where a lexeme is singular. Thus, we will classify lexemes by their overall statistical tendency. The question is this: is a given lexeme statistically associated with quantification? If so, we label it as nominal. If it is statistically disassociated, we label it as a particle. 

## Nominal versus Particle

Once we have classified the lexemes based on their statistical tendencies, we can use those classes for collocation measurements with yiqtol and the other tenses.

<hr>

<center><h2>Python</h2></center>

## Import Modules and Data

In [13]:
# standard packages
from pathlib import Path
import collections
import networkx as nx
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# custom package in /tools
from paths import figs
from helpers import Figures, convert2pandas
from tf_tools.load import load_tf
from tf_tools.tokenizers import tokenize_surface
from cx_analysis.load import cxs
from cx_analysis.search import SearchCX
from stats.significance import contingency_table, apply_fishers

TF, API, A = load_tf(silent='deep')
A.displaySetup(condenseType='phrase')
F, E, T, L = A.api.F, A.api.E, A.api.T, A.api.L
se = SearchCX(A)
phrase2cxs = cxs['phrase2cxs']
class2cx = cxs['class2cx']
time_cxs = list(phrase2cxs.values())
sns.set(font_scale=1.5, style='whitegrid')

## Measure Quantification Associations 

We now measure the quantification tendencies of time adverbial heads and build the noun/particle categories.

For each lexeme, we iterate through all of its occurrences in the Hebrew Bible. We then make a count for each occurrence on whether it appears with any of the following constructions:

* plural noun ending
* quantification with a cardinal number or qualitative quantifier (e.g. כל)
* modification with a definite article
* modification with a demonstrative pronoun 

We will *exclude* the following contexts:

* non-verbal clauses
* multi-phrasal time adverbials
* any clause with a non-finite main verb 

In [42]:
print('Number of single time adverbials:')
len(class2cx['single'] - class2cx['component'])

Number of single time adverbials:


3823

In [24]:
# def geta(item, attrib, default=None):
#     """Safely retrieve attribute from object

#     Some objects in a CX graph are TF integer
#     nodes, while most are CX objects. In order
#     to safely call attributes on a given position,
#     we need to handle attribute errors when called
#     on an integer.
#     """
#     try:
#         return item.__dict__[attrib]
#     except AttributeError:
#         return default

# def get_head_modi(cx, name, default=None):
#     """Retrieve a modifier on a particular head"""
#     head = list(cx.getsuccroles('head'))[-1]
#     for c in cx.graph:
#         if (geta(c,'name') == name) and (head in c):
#             return c
#     # unsuccessful search
#     return default

In [85]:
head2count = collections.defaultdict(lambda:collections.Counter()) # count: nom, ønom
head2case = collections.defaultdict(lambda:collections.defaultdict(list))
head2verbcount = collections.defaultdict(lambda:collections.Counter())

tense_map = {'ptca': 'ptcp'}

for ta in (class2cx['single'] - class2cx['component']):
    head = list(ta.getsuccroles('head'))[-1]
    
    # skip if no verb in clause
    clause = L.u(head, 'clause')[0]
    verb = next((w for w in L.d(clause,'word') if F.pdp.v(w) == 'verb'), None)
    tense = tense_map.get(F.vt.v(verb), F.vt.v(verb))
    if not verb:
        continue
    if F.sp.v(head) == 'verb':
        continue
    if tense not in {
        'impf', 'perf', 
        'wayq', 'weqt',
        'impv', 'ptcp',
    }:
        continue
    
    # make the counts
    nominal_markers = (
        F.nu.v(head) in {'du', 'pl'},
        'quantified' in ta.classification,
        'definite' in ta.classification,
        'demonstrative' in ta.classification,
    )
    
    tag = 'nom' if any(nominal_markers) else 'ønom'
    head2count[F.lex.v(head)][tag] += 1
    head2case[F.lex.v(head)][tag].append(ta)
    head2verbcount[F.lex.v(head)][tense] += 1

In [87]:
headcounts = pd.DataFrame.from_dict(head2count, orient='index').fillna(0)
verbcounts = pd.DataFrame.from_dict(head2verbcount, orient='index').fillna(0)

headcounts.shape

(99, 2)

In [89]:
headcounts.head(20)

Unnamed: 0,nom,ønom
JWM/,971.0,159.0
LJLH/,74.0,34.0
YHRJM/,13.0,0.0
CNH/,199.0,19.0
DBR/,11.0,1.0
P<M/,51.0,3.0
<WLM/,6.0,132.0
CB</,6.0,0.0
<RB/,71.0,1.0
<T/,78.0,46.0


In [90]:
verbcounts.head(20)

Unnamed: 0,perf,wayq,impf,ptcp,weqt,impv
JWM/,331.0,257.0,329.0,89.0,96.0,28.0
>Z,43.0,1.0,72.0,3.0,0.0,0.0
LJLH/,23.0,53.0,14.0,10.0,4.0,4.0
YHRJM/,1.0,3.0,6.0,0.0,2.0,1.0
CNH/,99.0,67.0,36.0,3.0,11.0,2.0
<WD/,42.0,48.0,198.0,24.0,8.0,5.0
MXR/,1.0,0.0,20.0,4.0,1.0,7.0
DBR/,3.0,8.0,1.0,0.0,0.0,0.0
P<M/,12.0,16.0,14.0,2.0,7.0,3.0
<WLM/,23.0,6.0,77.0,9.0,19.0,4.0


In [73]:
# for cx in head2case['MHR[']['ønom']:
#     se.showcx(cx, condenseType='sentence')

In [74]:
headcounts_assoc = apply_fishers(headcounts)

In [75]:
nominals = pd.DataFrame(headcounts_assoc['nom'][headcounts_assoc['nom'] > 1.3])

nominals.shape

(14, 1)

In [81]:
nominals.sort_values(by='nom', ascending=False)

Unnamed: 0,nom
JWM/,164.627251
CNH/,33.719059
<RB/,17.656001
BQR=/,11.856963
P<M/,8.553416
XDC=/,4.504992
YHRJM/,3.632188
PNH/,3.632188
LJLH/,2.079186
DBR/,1.796208


In [77]:
particles = pd.DataFrame(headcounts_assoc['nom'][headcounts_assoc['nom'] < -1.3])

particles.shape

(23, 1)

In [79]:
particles.sort_values(by='nom')

Unnamed: 0,nom
<WD/,-132.965951
>Z,-42.302346
<WLM/,-40.791604
<TH,-29.337472
KN,-17.478837
TMJD/,-14.301896
MXR/,-11.489408
NYX/,-8.687614
PT>M,-7.988826
MWT/,-7.63968


## Collocation Tests

In [106]:
nominal_collocations = verbcounts.loc[nominals.index].sum()
particle_collocations = verbcounts.loc[particles.index].sum()

nominal_ratios = nominal_collocations / nominal_collocations.sum()
particle_ratios = particle_collocations / particle_collocations.sum()

In [107]:
nominal_collocations

perf    513.0
wayq    479.0
impf    474.0
ptcp    117.0
weqt    151.0
impv     57.0
dtype: float64

In [108]:
particle_collocations

perf    209.0
wayq    128.0
impf    546.0
ptcp     62.0
weqt     42.0
impv     35.0
dtype: float64

In [109]:
nominal_ratios

perf    0.286432
wayq    0.267448
impf    0.264657
ptcp    0.065327
weqt    0.084310
impv    0.031826
dtype: float64

In [111]:
particle_ratios

perf    0.204501
wayq    0.125245
impf    0.534247
ptcp    0.060665
weqt    0.041096
impv    0.034247
dtype: float64