# Build Head Word | Function Association Scores

I would like to distinguish time phrases headed by nouns normally associated with the time function from those that are not. This would allow for queries to specify that a "time noun" heads the phrase, instead of less-associated terms, and thus would create a more stable dataset. For this purpose, I will create a set of terms that can be included in queries.

To measure the associational strengths properly, it is necessary/helpful to compare counts against a noun's appearance with any other given function.  

An earlier version of this analysis was done in [BH_time_collocations](BH_time_collocations.ipynb) on the SBH corpus.

This data will be exported as TF node features, stored on the nouns themselves. 

In [1]:
from tf.app import use

import collections, random
import pandas as pd
import numpy as np
import scipy.stats as stats
import sys
sys.path.append('..')
from pyscripts.significance import contingency_table, apply_fishers

# load BHSA and heads data
A = use('bhsa', mod='etcbc/heads/tf', hoist=globals())
A.displaySetup(condenseType='clause') # configure Hebrew display

ModuleNotFoundError: No module named 'pyscripts'

# Prepare the Data

In [4]:
A.api.indent(reset=True)
A.api.info('running query...')


time_nheads = A.search('''

phrase function#PtcO|PreS|PreO
    <nhead- word pdp#prep ls#card|ppre

''')

# mappings to strings to prevent unnecessary splitting
funct_maps = {'PreO': 'Pred', 'PreS': 'Pred', 'PtcO': 'Pred',
              'IntS': 'Intj', 'NCoS': 'NCop','ModS': 'Modi',
              'ExsS': 'Exst'}

# make the counts
functions = collections.defaultdict(lambda: collections.Counter())

A.api.info('making counts of features...')
for phrase, head_word in time_nheads:
    function = funct_maps.get(F.function.v(phrase), F.function.v(phrase))
    head_lex = F.lex.v(head_word)
    functions[function][head_lex] += 1
    
functions = pd.DataFrame(functions).fillna(0)

A.api.info('DONE')

  0.00s running query...
  2.68s 242834 results
  2.68s making counts of features...
  3.06s DONE


In [5]:
functions.shape

(7722, 21)

In [6]:
functions.head()

Unnamed: 0,Time,Pred,Subj,Objc,Conj,PreC,Cmpl,Rela,Modi,Adju,...,Intj,Frnt,Nega,PrcS,Ques,Voct,NCop,PrAd,Exst,EPPr
<B/,0.0,0.0,8.0,3.0,0.0,1.0,7.0,0.0,0.0,6.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
<B=/,0.0,0.0,2.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
<B==/,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
<BC[,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
<BD/,0.0,0.0,211.0,81.0,0.0,90.0,152.0,0.0,0.0,30.0,...,0.0,6.0,0.0,0.0,0.0,7.0,0.0,1.0,0.0,0.0


In [7]:
functions.sort_values(by='Time', ascending=False).head()

Unnamed: 0,Time,Pred,Subj,Objc,Conj,PreC,Cmpl,Rela,Modi,Adju,...,Intj,Frnt,Nega,PrcS,Ques,Voct,NCop,PrAd,Exst,EPPr
JWM/,1601.0,0.0,200.0,98.0,0.0,80.0,42.0,0.0,1.0,57.0,...,0.0,16.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
CNH/,420.0,0.0,40.0,22.0,0.0,145.0,14.0,0.0,0.0,31.0,...,0.0,3.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0,0.0
<TH,368.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,61.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
<WLM/,212.0,0.0,0.0,1.0,0.0,67.0,3.0,0.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
LJLH/,175.0,0.0,10.0,6.0,0.0,2.0,3.0,0.0,0.0,3.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


## Apply Significance Tests

In [8]:
A.api.indent(reset=True)
A.api.info('applying Fisher\'s exact tests...')
functions = apply_fishers(functions)

A.api.info('DONE.')
print(functions.shape)
functions.head()

  0.00s applying Fisher's exact tests...


  strength = -np.log10(p_value)
  strength = np.log10(p_value)


 4m 35s DONE.
(7722, 21)


Unnamed: 0,Time,Pred,Subj,Objc,Conj,PreC,Cmpl,Rela,Modi,Adju,...,Intj,Frnt,Nega,PrcS,Ques,Voct,NCop,PrAd,Exst,EPPr
<B/,0.0,-2.769127,1.745078,0.135391,-2.05406,-0.143866,2.148593,0.0,0.0,3.547725,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
<B=/,0.0,0.0,1.267672,0.0,0.0,0.663646,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
<B==/,0.0,0.0,0.0,0.0,0.0,0.0,1.023261,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
<BC[,0.0,0.628902,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
<BD/,-4.107645,-67.202052,40.471062,2.794753,-52.562591,9.304012,30.70711,-6.26877,-3.56768,1.09119,...,-1.565169,0.847027,-5.916329,0.0,-0.898621,0.894243,-0.375272,0.332942,0.0,0.0


## Replace inf

Several terms have a significance score of infinity, due to taking the log10 of 0, where Fisher's Exact has produced a 0 score (i.e. meaning 0% chance that the association is accidental). We replace `inf` with the maximum non-infinite score in the dataset, and we do the same with the inverse `-inf`.

In [9]:
ds_max = functions[functions != np.inf].max().max()
ds_min = functions[functions != -np.inf].min().min()

In [10]:
for funct in functions:
    for lex in functions.index:
        if functions[funct][lex] == np.inf:
            functions[funct][lex] = ds_max
        elif functions[funct][lex] == -np.inf:
            functions[funct][lex] = ds_min

## Explore Dataset

1.3 is the approximate threshold for statistical significance when using log10 transformed p-values from Fisher's.

### Associations with Time

In [11]:
functions.Time[functions.Time > 1.3].sort_values(ascending=False)

CNH/       318.859989
JWM/       318.859989
<TH        318.859989
<WLM/      305.933030
LJLH/      277.925214
<T/        230.846877
BQR=/      228.450155
<RB/       143.245480
XDC=/      135.538720
>DJN       100.713351
>Z          99.648700
JWMM        70.723632
MXR/        65.973781
<D/         47.404521
MXRT/       44.077848
NYX/        39.279495
TMJD/       38.270440
MTJ         32.476964
DWR/        30.779970
MWT/        27.310535
TMWL/       27.276753
YHRJM/      24.579712
KN          22.635229
MW<D/       19.477118
R>CWN/      17.063901
<LM/        16.869511
>N          14.347832
K<N         13.052701
N<WRJM/     12.625264
>TMWL/       9.747058
              ...    
HNH==        3.817462
DNH          3.685486
XJJM/        3.532438
MBWL/        3.401885
>XRWN/       3.300293
R>CJT/       3.170443
K<NT         3.056051
TQWPH/       3.056051
BVN/         2.982834
MLKWT/       2.854706
QYJR/        2.830419
<FJRJ/       2.793066
DJ/          2.774199
LJL/         2.372071
QDM/      

### Associations with Loca (location)

In [12]:
functions.Loca[functions.Loca.round() > 1.3].sort_values(ascending=False)

CM             318.859989
>RY/           177.949536
MDBR/          102.900317
HR/             87.047975
MQWM/           58.429699
BJT/            56.358666
JRWCLM/         47.692306
XWY/            44.525367
<JR/            38.018648
FDH/            31.177761
CMJM/           30.099944
QRB/            28.849811
C<R/            28.051077
PTX/            27.422209
GBWL/           25.845651
CMC/            25.181131
PH              24.498467
JM/             21.898664
<RBH/           19.991434
>HL/            18.928615
RXB==/          18.650725
YJWN==/         18.630934
JRDN/           17.708963
XBRWN=/         17.291705
>DMH/           16.951844
CMRWN/          16.628851
XRB===/         16.570194
MYRJM/          16.512203
BMH/            15.784814
P>H/            13.822764
                  ...    
XDJD/            1.655976
S<JP=/           1.655976
>SP=/            1.655976
B<L_YPN/         1.655976
MXNH_DN/         1.655976
CQT/             1.655976
GTJM/            1.655976
JBCT/       

### Associations with Objc (direct object)

In [13]:
functions.Objc[functions.Objc.round() > 1.3].sort_values(ascending=False).head(25).round()

DBR/     185.0
BRJT/    109.0
LXM/     107.0
BGD/     100.0
MCPV/     81.0
MGRC/     77.0
CM/       76.0
NPC/      74.0
<LH/      72.0
KSP/      71.0
MH        70.0
ZHB/      64.0
DM/       62.0
MYWH/     61.0
KLJ/      60.0
R<H/      51.0
QWL/      48.0
BJT/      46.0
MZBX/     45.0
XN/       45.0
PR/       45.0
MJM/      45.0
<Y/       44.0
MNXH/     44.0
KL/       42.0
Name: Objc, dtype: float64

### Associations with Cmpl (complement)

In [14]:
functions.Cmpl[functions.Cmpl.round() > 1.3].sort_values(ascending=False).head(25).round()

>RY/       319.0
BJT/       250.0
PNH/       212.0
JD/        211.0
JRWCLM/    167.0
CM         139.0
MYRJM/     105.0
MLK/       101.0
<JR/       100.0
<M/        100.0
JHWH/       92.0
MQWM/       90.0
HR/         86.0
<JN/        82.0
MCH=/       80.0
JFR>L/      75.0
>HL/        62.0
>B/         60.0
SPR/        58.0
BBL/        53.0
MZBX/       50.0
>C/         47.0
>DMH/       43.0
MXNH/       42.0
PR<H/       42.0
Name: Cmpl, dtype: float64

### Associations with Subj (subject)

In [15]:
functions.Subj[functions.Subj.round() > 1.3].sort_values(ascending=False).head(25).round()

JHWH/     319.0
>JC/      319.0
>NJ       319.0
>TH       319.0
HW>       319.0
HJ>       280.0
>NKJ      279.0
BN/       254.0
>TM       214.0
MLK/      211.0
HMH       196.0
HM        177.0
DWD==/    173.0
>LH       165.0
MJ        139.0
KHN/      129.0
MCH=/     106.0
>DNJ/      96.0
>LHJM/     96.0
<M/        93.0
>NXNW      93.0
C>WL=/     81.0
ZH         80.0
JHWC</     65.0
KL/        55.0
Name: Subj, dtype: float64

## Export TF Node Features

We export a TF feature. The feature names will follow the template: `FunctionAssoc`, based on the ETCBC function abbreviations.

Every function in the dataset will receive a separate feature, stored on the word itself. The value will be an integer. E.g. `TimeAssoc`. This allows for queries such as the following:

```
phrase function=Time
    <head- word TimeAssoc>50
```

### Build Feature Dict

#### Map lexemes strings to word nodes

In [16]:
lex2nodes = {}

for lexeme in functions.index:
    lexnode = sorted((F.freq_lex.v(lex), lex) for lex in F.otype.s('lex') if F.lex.v(lex) == lexeme)[-1][-1]
    word_nodes = L.d(lexnode, 'word')
    lex2nodes[lexeme] = word_nodes

#### Map scores to word nodes

In [35]:
nodeFeatures = collections.defaultdict(lambda: collections.defaultdict())

for function in functions:
    for lexeme in functions.index:
        
        assoc_score = round(functions[function][lexeme])    
        for wn in lex2nodes[lexeme]:
            nodeFeatures[f'{function}Assoc'][wn] = int(assoc_score)
            
for lexeme in functions.index:
    top_assoc = functions.loc[lexeme].sort_values(ascending=False).index[0]
    for wn in lex2nodes[lexeme]:
        nodeFeatures['topAssoc'][wn] = top_assoc

In [36]:
nodeFeatures['TimeAssoc'][2]

3

In [37]:
nodeFeatures['topAssoc'][2]

'Objc'

### Export

In [46]:
from tf.fabric import Fabric

# metadata needed to write the features
meta = {
    
'': {'created_by': 'Cody Kingham',
     'coreData': 'BHSA',
     'coreVersion': 'c',
     'source': 'see the creation notebook in https://github.com/CambridgeSemiticsLab/BH_time_collocations'},
'topAssoc': {'valueType':'str',
             'description':'top associated function to this word',
             'interpreting scores':'score > 1.3 is significantly attracted; score < -1.3 is significantly repelled'}
}
    
for feature in nodeFeatures:
    if feature == 'topAssoc': continue
    meta[feature] = {'valueType':'int', 
                     'interpreting scores':'score > 1.3 is significantly attracted; score < -1.3 is significantly repelled'}

TF = Fabric(locations='../../data/funct_associations/', silent=True)

In [47]:
TF.save(nodeFeatures=nodeFeatures, metaData=meta)

   |     0.54s T AdjuAssoc            to /Users/cody/github/csl/time_collocations/analysis/../data/funct_associations
   |     0.52s T CmplAssoc            to /Users/cody/github/csl/time_collocations/analysis/../data/funct_associations
   |     0.51s T ConjAssoc            to /Users/cody/github/csl/time_collocations/analysis/../data/funct_associations
   |     0.54s T EPPrAssoc            to /Users/cody/github/csl/time_collocations/analysis/../data/funct_associations
   |     0.49s T ExstAssoc            to /Users/cody/github/csl/time_collocations/analysis/../data/funct_associations
   |     0.51s T FrntAssoc            to /Users/cody/github/csl/time_collocations/analysis/../data/funct_associations
   |     0.56s T IntjAssoc            to /Users/cody/github/csl/time_collocations/analysis/../data/funct_associations
   |     0.54s T LocaAssoc            to /Users/cody/github/csl/time_collocations/analysis/../data/funct_associations
   |     0.56s T ModiAssoc            to /Users/cody/git

True

## Small Test

Let's find time phrases with a significantly repelled head word.

In [3]:
# load BHSA, heads data, and association data
from tf.app import use


A = use('bhsa', mod='etcbc/heads/tf,CambridgeSemiticsLab/BH_time_collocations/tf', hoist=globals())
A.displaySetup(condenseType='clause') # configure Hebrew display

TF app is up-to-date.
Using annotation/app-bhsa commit d3cf8f0c2ab5d690a0fda14ea31c33da5c5c8483 (=latest)
  in /Users/cody/text-fabric-data/__apps__/bhsa.
Using etcbc/bhsa/tf - c rv1.6 in /Users/cody/text-fabric-data
Using etcbc/phono/tf - c r1.2 in /Users/cody/text-fabric-data
Using etcbc/parallels/tf - c r1.2 in /Users/cody/text-fabric-data
Using etcbc/heads/tf - c rv.1.3 in /Users/cody/text-fabric-data
	downloading CambridgeSemiticsLab/BH_time_collocations - c rv1.1
	from https://github.com/CambridgeSemiticsLab/BH_time_collocations/releases/download/v1.1/tf-c.zip ...
	unzipping ...
	saving CambridgeSemiticsLab/BH_time_collocations - c rv1.1
	saved CambridgeSemiticsLab/BH_time_collocations - c rv1.1
Using CambridgeSemiticsLab/BH_time_collocations/tf - c rv1.1 (=latest) in /Users/cody/text-fabric-data


In [6]:
rare_use = A.search('''

phrase function=Time
    <nhead- word TimeAssoc<0

''')

  0.95s 32 results


In [7]:
A.show(rare_use, condenseType='clause')



**result** *1*





**result** *2*





**result** *3*





**result** *4*





**result** *5*





**result** *6*





**result** *7*





**result** *8*





**result** *9*





**result** *10*





**result** *11*





**result** *12*





**result** *13*





**result** *14*





**result** *15*





**result** *16*





**result** *17*





**result** *18*





**result** *19*





**result** *20*





**result** *21*





**result** *22*





**result** *23*





**result** *24*





**result** *25*





**result** *26*





**result** *27*





**result** *28*





**result** *29*





**result** *30*





**result** *31*





**result** *32*

