# Build Head Word | Function Association Scores

I would like to distinguish time phrases headed by nouns normally associated with the time function from those that are not. This would allow for queries to specify that a "time noun" heads the phrase, instead of less-associated terms, and thus would create a more stable dataset. For this purpose, I will create a set of terms that can be included in queries.

To measure the associational strengths properly, it is necessary/helpful to compare counts against a noun's appearance with any other given function.  

An earlier version of this analysis was done in [BH_time_collocations](BH_time_collocations.ipynb) on the SBH corpus.

This data will be exported as TF node features, stored on the nouns themselves. 

In [1]:
from tf.app import use

import collections, random
import pandas as pd
import numpy as np
import scipy.stats as stats
from pyscripts.significance import contingency_table, apply_fishers


# load BHSA and heads data
A = use('bhsa', mod='etcbc/heads/tf', hoist=globals())
A.displaySetup(condenseType='clause') # configure Hebrew display

TF app is up-to-date.
Using annotation/app-bhsa commit 7f353d587f4befb6efe1742831e28f301d2b3cea (=latest)
  in /Users/cody/text-fabric-data/__apps__/bhsa.
Using etcbc/bhsa/tf - c rv1.6 in /Users/cody/text-fabric-data
Using etcbc/phono/tf - c r1.2 in /Users/cody/text-fabric-data
Using etcbc/parallels/tf - c r1.2 in /Users/cody/text-fabric-data
Using etcbc/heads/tf - c rv.1.11 in /Users/cody/text-fabric-data


**Documentation:** <a target="_blank" href="https://etcbc.github.io/bhsa" title="provenance of BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis">BHSA</a> <a target="_blank" href="https://annotation.github.io/text-fabric/Writing/Hebrew" title="('Hebrew characters and transcriptions',)">Character table</a> <a target="_blank" href="https://etcbc.github.io/bhsa/features/hebrew/c/0_home.html" title="BHSA feature documentation">Feature docs</a> <a target="_blank" href="https://github.com/annotation/app-bhsa" title="bhsa API documentation">bhsa API</a> <a target="_blank" href="https://annotation.github.io/text-fabric/Api/Fabric/" title="text-fabric-api">Text-Fabric API 7.4.5</a> <a target="_blank" href="https://annotation.github.io/text-fabric/Use/Search/" title="Search Templates Introduction and Reference">Search Reference</a>

# Prepare the Data

In [2]:
A.api.indent(reset=True)
A.api.info('running query...')


time_nheads = A.search('''

phrase function#PtcO|PreS|PreO
    <nhead- word pdp#prep ls#card|ppre

''')

# mappings to strings to prevent unnecessary splitting
funct_maps = {'PreO': 'Pred', 'PreS': 'Pred', 'PtcO': 'Pred',
              'IntS': 'Intj', 'NCoS': 'NCop','ModS': 'Modi',
              'ExsS': 'Exst'}

# make the counts
functions = collections.defaultdict(lambda: collections.Counter())

A.api.info('making counts of features...')
for phrase, head_word in time_nheads:
    function = funct_maps.get(F.function.v(phrase), F.function.v(phrase))
    head_lex = F.lex.v(head_word)
    functions[function][head_lex] += 1
    
functions = pd.DataFrame(functions).fillna(0)

A.api.info('DONE')

  0.00s running query...
  2.51s 242800 results
  2.51s making counts of features...
  2.89s DONE


In [3]:
functions.shape

(7720, 21)

In [4]:
functions.head()

Unnamed: 0,Time,Pred,Subj,Objc,Conj,PreC,Cmpl,Rela,Modi,Adju,...,Intj,Frnt,Nega,PrcS,Ques,Voct,NCop,PrAd,Exst,EPPr
<B/,0.0,0.0,8.0,3.0,0.0,1.0,7.0,0.0,0.0,6.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
<B=/,0.0,0.0,2.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
<B==/,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
<BC[,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
<BD/,0.0,0.0,211.0,81.0,0.0,90.0,152.0,0.0,0.0,30.0,...,0.0,6.0,0.0,0.0,0.0,7.0,0.0,1.0,0.0,0.0


In [5]:
functions.sort_values(by='Time', ascending=False).head()

Unnamed: 0,Time,Pred,Subj,Objc,Conj,PreC,Cmpl,Rela,Modi,Adju,...,Intj,Frnt,Nega,PrcS,Ques,Voct,NCop,PrAd,Exst,EPPr
JWM/,1595.0,0.0,200.0,98.0,0.0,79.0,42.0,0.0,1.0,56.0,...,0.0,16.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
CNH/,420.0,0.0,38.0,22.0,0.0,79.0,11.0,0.0,0.0,9.0,...,0.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
<TH,368.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,61.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
<WLM/,212.0,0.0,0.0,1.0,0.0,67.0,3.0,0.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
LJLH/,174.0,0.0,10.0,6.0,0.0,2.0,3.0,0.0,0.0,3.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


## Apply Significance Tests

In [6]:
A.api.indent(reset=True)
A.api.info('applying Fisher\'s exact tests...')
functions = apply_fishers(functions)

A.api.info('DONE.')
print(functions.shape)
functions.head()

  0.00s applying Fisher's exact tests...


  strength = -np.log10(p_value)
  strength = np.log10(p_value)


 4m 40s DONE.
(7720, 21)


Unnamed: 0,Time,Pred,Subj,Objc,Conj,PreC,Cmpl,Rela,Modi,Adju,...,Intj,Frnt,Nega,PrcS,Ques,Voct,NCop,PrAd,Exst,EPPr
<B/,0.0,-2.769282,1.744949,0.135411,-2.054064,-0.143865,2.148571,0.0,0.0,3.547652,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
<B=/,0.0,0.0,1.267629,0.0,0.0,0.663695,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
<B==/,0.0,0.0,0.0,0.0,0.0,0.0,1.023257,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
<BC[,0.0,0.628841,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
<BD/,-4.108957,-67.205475,40.467626,2.8894,-52.561667,9.305844,30.706668,-6.268334,-3.567391,1.091142,...,-1.565058,0.847437,-6.129318,0.0,-0.898575,0.89408,-0.375254,0.331707,0.0,0.0


## Replace inf

Several terms have a significance score of infinity, due to taking the log10 of 0, where Fisher's Exact has produced a 0 score (i.e. meaning 0% chance that the association is accidental). We replace `inf` with the maximum non-infinite score in the dataset, and we do the same with the inverse `-inf`.

In [17]:
ds_max = functions[functions != np.inf].max().max()
ds_min = functions[functions != -np.inf].min().min()

In [18]:
for funct in functions:
    for lex in functions.index:
        if functions[funct][lex] == np.inf:
            functions[funct][lex] = ds_max
        elif functions[funct][lex] == -np.inf:
            functions[funct][lex] = ds_min

## Explore Dataset

1.3 is the approximate threshold for statistical significance when using log10 transformed p-values from Fisher's.

### Associations with Time

In [21]:
functions.Time[functions.Time > 1.3].sort_values(ascending=False)

CNH/       318.824500
JWM/       318.824500
<TH        318.824500
<WLM/      306.145073
LJLH/      276.377016
<T/        231.018372
BQR=/      215.136281
<RB/       143.335363
XDC=/      135.634145
>DJN       100.769564
>Z          99.725797
JWMM        69.035187
MXR/        66.012120
<D/         47.435813
MXRT/       44.104323
NYX/        39.306791
TMJD/       38.302253
MTJ         32.501222
DWR/        30.813646
MWT/        27.339613
TMWL/       27.294335
YHRJM/      24.595344
KN          22.679062
MW<D/       19.496049
R>CWN/      17.082539
<LM/        16.880249
>N          14.361132
K<N         13.061461
N<WRJM/     12.635814
>TMWL/       9.752922
              ...    
DNH          3.690004
XJJM/        3.538312
>CMRT/       3.530119
MBWL/        3.404746
>XRWN/       3.303967
R>CJT/       3.174836
TQWPH/       3.057998
K<NT         3.057998
BVN/         2.987173
QYJR/        2.834001
<FJRJ/       2.795865
DJ/          2.777767
LJL/         2.373985
QDM/         2.353510
RBJ<J/    

### Associations with Loca (location)

In [22]:
functions.Loca[functions.Loca.round() > 1.3].sort_values(ascending=False)

CM             318.824500
>RY/           177.935486
MDBR/          102.895298
HR/             87.042544
MQWM/           58.425871
BJT/            56.352554
JRWCLM/         47.688628
XWY/            44.522941
<JR/            38.048015
FDH/            31.175681
CMJM/           30.097724
QRB/            28.847802
C<R/            28.048977
PTX/            27.420517
GBWL/           25.843820
CMC/            25.179663
PH              24.497218
JM/             21.896972
<RBH/           19.990478
>HL/            18.927102
RXB==/          18.649830
YJWN==/         18.629822
JRDN/           17.707755
XBRWN=/         17.290814
>DMH/           16.950546
CMRWN/          16.627857
XRB===/         16.569589
MYRJM/          16.510681
BMH/            15.783825
P>H/            13.821945
                  ...    
<JN_DWR/         1.655916
<JH/             1.655916
S<JP=/           1.655916
>SP=/            1.655916
BJT_HJCMWT/      1.655916
B<L_YPN/         1.655916
CQT/             1.655916
GLB</       

### Associations with Objc (direct object)

In [23]:
functions.Objc[functions.Objc.round() > 1.3].sort_values(ascending=False).head(25).round()

DBR/     185.0
BRJT/    109.0
LXM/     107.0
BGD/     100.0
MCPV/     81.0
MGRC/     77.0
CM/       76.0
NPC/      74.0
<LH/      72.0
KSP/      71.0
MH        70.0
ZHB/      64.0
DM/       62.0
MYWH/     61.0
KLJ/      59.0
R<H/      51.0
QWL/      48.0
BJT/      46.0
MZBX/     45.0
XN/       45.0
PR/       45.0
MJM/      45.0
<Y/       44.0
MNXH/     44.0
KL/       42.0
Name: Objc, dtype: float64

### Associations with Cmpl (complement)

In [24]:
functions.Cmpl[functions.Cmpl.round() > 1.3].sort_values(ascending=False).head(25).round()

>RY/       319.0
BJT/       250.0
PNH/       212.0
JD/        211.0
JRWCLM/    167.0
CM         139.0
MYRJM/     105.0
MLK/       101.0
<JR/       100.0
<M/        100.0
JHWH/       92.0
MQWM/       90.0
HR/         86.0
<JN/        82.0
MCH=/       80.0
JFR>L/      75.0
>HL/        62.0
>B/         60.0
SPR/        58.0
BBL/        53.0
MZBX/       50.0
>C/         47.0
>DMH/       43.0
MXNH/       42.0
PR<H/       42.0
Name: Cmpl, dtype: float64

### Associations with Subj (subject)

In [25]:
functions.Subj[functions.Subj.round() > 1.3].sort_values(ascending=False).head(25).round()

HW>       319.0
>TH       319.0
>NJ       319.0
>JC/      319.0
JHWH/     319.0
HJ>       280.0
>NKJ      279.0
BN/       239.0
>TM       214.0
MLK/      211.0
HMH       196.0
HM        177.0
DWD==/    173.0
>LH       165.0
MJ        139.0
KHN/      129.0
MCH=/     106.0
>DNJ/      96.0
>LHJM/     96.0
<M/        93.0
>NXNW      93.0
C>WL=/     81.0
ZH         80.0
JHWC</     65.0
KL/        55.0
Name: Subj, dtype: float64

## Export TF Node Features

We export a TF feature. The feature names will follow the template: `FunctionAssoc`, based on the ETCBC function abbreviations.

Every function in the dataset will receive a separate feature, stored on the word itself. The value will be an integer. E.g. `TimeAssoc`. This allows for queries such as the following:

```
phrase function=Time
    <head- word TimeAssoc>50
```

### Build Feature Dict

#### Map lexemes strings to word nodes

In [39]:
lex2nodes = {}

for lexeme in functions.index:
    lexnode = sorted((F.freq_lex.v(lex), lex) for lex in F.otype.s('lex') if F.lex.v(lex) == lexeme)[-1][-1]
    word_nodes = L.d(lexnode, 'word')
    lex2nodes[lexeme] = word_nodes

#### Map scores to word nodes

In [61]:
nodeFeatures = collections.defaultdict(lambda: collections.defaultdict())

for function in functions:
    for lexeme in functions.index:
        
        assoc_score = round(functions[function][lexeme])

        for wn in lex2nodes[lexeme]:
            nodeFeatures[f'{function}Assoc'][wn] = int(assoc_score)

In [62]:
nodeFeatures['TimeAssoc'][2]

3

### Export

In [76]:
from tf.fabric import Fabric

# metadata needed to write the features
meta = {
    
'': {'created_by': 'Cody Kingham',
     'coreData': 'BHSA',
     'coreVersion': 'c',
     'source': 'see the creation notebook in https://github.com/CambridgeSemiticsLab/BH_time_collocations',}
}
    
for feature in nodeFeatures:
    meta[feature] = {'valueType':'int', 
                     'interpreting scores':'score > 1.3 is significantly attracted; score < -1.3 is significantly repelled'}

TF = Fabric(locations='../tf/c', silent=True)

In [77]:
TF.save(nodeFeatures=nodeFeatures, metaData=meta)

   |     0.48s T AdjuAssoc            to /Users/cody/github/csl/time_collocations/tf/c
   |     0.46s T CmplAssoc            to /Users/cody/github/csl/time_collocations/tf/c
   |     0.44s T ConjAssoc            to /Users/cody/github/csl/time_collocations/tf/c
   |     0.42s T EPPrAssoc            to /Users/cody/github/csl/time_collocations/tf/c
   |     0.45s T ExstAssoc            to /Users/cody/github/csl/time_collocations/tf/c
   |     0.52s T FrntAssoc            to /Users/cody/github/csl/time_collocations/tf/c
   |     0.46s T IntjAssoc            to /Users/cody/github/csl/time_collocations/tf/c
   |     0.45s T LocaAssoc            to /Users/cody/github/csl/time_collocations/tf/c
   |     0.44s T ModiAssoc            to /Users/cody/github/csl/time_collocations/tf/c
   |     0.44s T NCopAssoc            to /Users/cody/github/csl/time_collocations/tf/c
   |     0.45s T NegaAssoc            to /Users/cody/github/csl/time_collocations/tf/c
   |     0.44s T ObjcAssoc            to /U

True

## Small Test

Let's find time phrases with a significantly repelled head word.

In [1]:
# load BHSA, heads data, and association data
from tf.app import use


A = use('bhsa', mod='etcbc/heads/tf,CambridgeSemiticsLab/BH_time_collocations/tf', hoist=globals())
A.displaySetup(condenseType='clause') # configure Hebrew display

TF app is up-to-date.
Using annotation/app-bhsa commit 7f353d587f4befb6efe1742831e28f301d2b3cea (=latest)
  in /Users/cody/text-fabric-data/__apps__/bhsa.
Using etcbc/bhsa/tf - c rv1.6 in /Users/cody/text-fabric-data
Using etcbc/phono/tf - c r1.2 in /Users/cody/text-fabric-data
Using etcbc/parallels/tf - c r1.2 in /Users/cody/text-fabric-data
Using etcbc/heads/tf - c rv.1.11 in /Users/cody/text-fabric-data
Using CambridgeSemiticsLab/BH_time_collocations/tf - c rv1.1 in /Users/cody/text-fabric-data


**Documentation:** <a target="_blank" href="https://etcbc.github.io/bhsa" title="provenance of BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis">BHSA</a> <a target="_blank" href="https://annotation.github.io/text-fabric/Writing/Hebrew" title="('Hebrew characters and transcriptions',)">Character table</a> <a target="_blank" href="https://etcbc.github.io/bhsa/features/hebrew/c/0_home.html" title="BHSA feature documentation">Feature docs</a> <a target="_blank" href="https://github.com/annotation/app-bhsa" title="bhsa API documentation">bhsa API</a> <a target="_blank" href="https://annotation.github.io/text-fabric/Api/Fabric/" title="text-fabric-api">Text-Fabric API 7.4.6</a> <a target="_blank" href="https://annotation.github.io/text-fabric/Use/Search/" title="Search Templates Introduction and Reference">Search Reference</a>

In [6]:
rare_use = A.search('''

phrase function=Time
    <nhead- word TimeAssoc<0

''')

  0.95s 32 results


In [7]:
A.show(rare_use, condenseType='clause')



**result** *1*





**result** *2*





**result** *3*





**result** *4*





**result** *5*





**result** *6*





**result** *7*





**result** *8*





**result** *9*





**result** *10*





**result** *11*





**result** *12*





**result** *13*





**result** *14*





**result** *15*





**result** *16*





**result** *17*





**result** *18*





**result** *19*





**result** *20*





**result** *21*





**result** *22*





**result** *23*





**result** *24*





**result** *25*





**result** *26*





**result** *27*





**result** *28*





**result** *29*





**result** *30*





**result** *31*





**result** *32*

