# Semantic Vector Space for Coordinate Parsing


<a href="https://www.ames.cam.ac.uk"><img src="../../docs/images/CambridgeU_BW.png" height="200px" width="200px" align="left"></a>

In [10]:
!echo "last updated"; date

last updated
Sat 15 Feb 2020 20:29:56 GMT


## Brief

In this notebook I want to construct a simple semantic vector space using a bag-of-words approach. The space will then be tested on several tasks for selecting coordinate pairs. I am interested in the possibility of integrating this ability into the construction parser.

I will only build the space for content words within time adverbial phrases. 

<hr>

# Python

In [53]:
import sys
import collections
from datetime import datetime
from positions import PositionsTF
from stats.significance import apply_fishers, contingency_table
from paths import semvector, tf_data
from tf_tools.load import load_tf
sys.path.append('../cxs')
from word_grammar import Words

from sklearn.metrics.pairwise import pairwise_distances
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
from tf.app import use
from tf.fabric import Fabric

# load custom BHSA data + heads
TF, API, A = load_tf()
F, E, T, L = A.api.F, A.api.E, A.api.T, A.api.L # shortform TF methods
A.displaySetup(condenseType='phrase', extraFeatures='lex')

This is Text-Fabric 7.9.0
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

122 features found and 5 ignored
  0.00s loading features ...
   |     0.00s No structure info in otext, the structure part of the T-API cannot be used
  8.28s All features loaded/computed - for details use loadLog()


## Get Context Counts Around Window (bag of words)

For every lexeme found in a timephrase, count the other lexemes that occur in it's vicinity of 5 words for every occurrence of that word in the Hebrew Bible. This allows us to construct an approximate semantic profile that can be compared between terms.

A "bag of words" model means that we do not consider the position of a context word relative to the target word (i.e. "ngrams").

In [5]:
words = Words(A)

In [6]:
words.findall(2)

[CX cont (2,)]

In [14]:
def get_window(word, model='bagofwords'):
    '''
    Build a contextual window, return context words.
    '''
    window = 5
    context = 'sentence'
    confeat = 'lex'
    P = PositionsTF(word, context, A).get
    fore = list(range(-window, 0))
    back = list(range(1, window+1))
    conwords = []
    for pos in (fore + back):
        cword = P(pos, confeat)
        if cword:
            if model == 'bagofwords':
                conwords.append(f'{cword}')
            elif model == 'ngram':
                conwords.append(f'{pos}.{cword}')
    return conwords

wordcons = collections.defaultdict(lambda:collections.Counter())

timelexs = set()

for ph in F.otype.s('timephrase'):
    for w in L.d(ph,'word'): 
        cx = words.findall(w)[0]
        if cx.name == 'cont':
            timelexs.add(L.u(w,'lex')[0])

timewords = set(
    w for lex in timelexs
        for w in L.d(lex,'word')
)

print(f'{len(timewords)} timewords ready for analysis...')

for w in timewords:
    context = get_window(w)
    wordcons[F.lex.v(w)].update(context)
        
wordcons = pd.DataFrame.from_dict(wordcons, orient='index').fillna(0)
        
print(f'{wordcons.shape[1]} words analyzed...')
print(f'\t{wordcons.shape[0]} word contexts analyzed...')

66627 timewords ready for analysis...
6316 words analyzed...
	263 word contexts analyzed...


In [15]:
wordcons.head()

Unnamed: 0,W,CKM[,H,<JR/,B,BQR=/,>L,<M/,PNH[,L,...,JQRH/,NDDJM/,>CMNJM/,MRWY/,BKRH=/,FRK[,QJR_MW>B/,MWRCT_GT/,BZR[,RJR[
>JC/,1543.0,3.0,1191.0,56.0,474.0,8.0,202.0,60.0,7.0,621.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
KL/,4115.0,5.0,3620.0,155.0,1717.0,5.0,378.0,428.0,8.0,1803.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
R>CJT/,32.0,0.0,17.0,0.0,14.0,0.0,2.0,1.0,0.0,19.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
>LHJM/,1118.0,1.0,1080.0,20.0,597.0,2.0,236.0,66.0,4.0,1062.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
BN/,3709.0,0.0,1474.0,44.0,683.0,5.0,431.0,63.0,1.0,1696.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
wordcons.shape

(263, 6316)

In [16]:
wordcons.shape[0] * wordcons.shape[1]

1661108

In [20]:
wordcons['CNH/'].sort_values(ascending=False).head(10)

BN/       229.0
CNH/      218.0
CLC/      146.0
CB</      122.0
XMC/      121.0
M>H/      111.0
<FRJM/    101.0
>XD/       96.0
<FRH/      88.0
MLK/       86.0
Name: CNH/, dtype: float64

## Measure Target Word / Context Associations 

We will adjust the raw counts with $\Delta P$, a unidirectional measure of statistical significance suggested by Nick Ellis, 2006, "Language Acquisition".

In [23]:
# contingency table
a,b,c,d,e = contingency_table(wordcons, 0, 1)

### Apply ΔP 

We need an efficient (i.e. simple) normalization method for such a large dataset. ΔP is such a test that includes contingency information [(Gries 2008)](https://www.researchgate.net/publication/233650934_Dispersions_and_adjusted_frequencies_in_corpora_further_explorations).

In [24]:
deltap = (a / (a + b) - c / (c + d)).fillna(0)

## Calculate Cosine Distance

In [90]:
distances_raw = pairwise_distances(np.nan_to_num(deltap.values), metric='cosine')

In [91]:
dist = pd.DataFrame(distances_raw, columns=wordcons.index, index=wordcons.index)

## Testing Efficacy

We want to use semantic vectors to disambiguate coordinate relations when there is more than one candidate to connect a target to.

### Hypothesis: Candidates for coordinate pairs can be distinguished by selecting the candidate with the shortest distance in semantic space from the target word.

In [92]:
def show_dist(target, compares):
    """Return candidates in order of distance."""
    return sorted(
        (dist.loc[target][comp], comp) 
            for comp in compares
    )

def show_phrase(phrase_node):
    """Show plain text of phrase without links"""
    A.plain(phrase_node, isLinked=False)

### כאב: with נחלה or יום?

In [93]:
show_phrase(777703)

In [94]:
show_dist('K>B/', ('XLH[', 'JWM/'))

[(0.7080663087842192, 'XLH['), (1.14774631861581, 'JWM/')]

The model prefers the pairing of כאב with חלה, a good match.

What about אנושׁ? Should it describe כאב, נחלה, or יום?

In [95]:
show_dist('>NWC=/', ('K>B/', 'XLH[', 'JWM/', ))

[(0.6080222611075488, 'XLH['),
 (0.6267557134554581, 'K>B/'),
 (1.2975244847974814, 'JWM/')]

The model is tricky here. It puts כאב at a much further distance than, say, יום. That is a spurious inference. But the pairing with חלה is interesting.

### אפלה: with לילה or אישׁון?

In [96]:
show_phrase(862564)

In [97]:
show_dist('>PLH/', ('LJLH/', '>JCWN/'))

[(0.4659176693391708, 'LJLH/'), (0.6617132376810086, '>JCWN/')]

Success. LJLH is most similar semantically.

### מרוד: with עניה or ימי?

In [98]:
show_phrase(872677)

In [99]:
show_dist('MRWD/', ('<NJ=/', 'JWM/'))

[(0.7001411338132839, '<NJ=/'), (1.413937448378942, 'JWM/')]

Success.

### אם: with אב or מות?

In [101]:
show_phrase(874237)

In [102]:
show_dist('>M/', ('>B/', 'MWT/'))

[(0.5060572999575104, '>B/'), (1.2196161704675816, 'MWT/')]

Success.

### ערפל: with ענן or יום?

In [103]:
show_phrase(817713)

In [104]:
show_dist('<RPL/', ('JWM/', '<NN/'))

[(0.7347457677452038, '<NN/'), (1.1129065082500667, 'JWM/')]

Success. <NN/ is correctly selected as more semantically similar.

# Export Vector Resource

In [105]:
import pickle

In [106]:
dist_dict = dist.to_dict()

In [107]:
with open(semvector, 'wb') as outfile:
    pickle.dump(dist_dict, outfile)