<img align="right" src="images/tf-small.png"/>

# Phrases in versions of the BHSA

In [versionMappings](https://github.com/ETCBC/bhsa/blob/master/programs/versionMappings.ipynb)
we have constructed edge features that map the nodes from one version of the data to the next.
In this notebook we are going to use those edges to study what happened to the feature `function`
of `phrases`.

# Overview

We explore:
* how the values of the `function` feature have changed;
* to what degree phrases have other boundaries.

# Discussion
The feature `function` was called `phrase_function` in version `3`.

## Phrase boundaries
In order to see whether phrase boundaries have changed, we follow the `omap@` edges from 
phrases in one version to their couterparts in the next version.

We make use of the dissimilarity values that are attached to such edges.
If there is no value, or the value is `0`, we have a match without a boundary change.
All other dissimilarities imply that boundaries have changed.

# Results
For the sake of presentation,
we start with the result cells, **they should be run after the other cells**.
The computation starts [here](#Start).

# Changes in `function` values

In [11]:
for (i, w) in enumerate(versions):
    if i == 0: continue
    v = versions[i-1]
    caption(1, 'Phrase function change from version {} to {}'.format(v, w))
    featureDiff(v, w, 'FUNCTION')


##############################################################################################
#                                                                                            #
#      6m 41s Phrase function change from version 3 to 4                                     #
#                                                                                            #
##############################################################################################



0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30
3\4,Adju,Cmpl,Conj,EPPr,ExsS,Exst,Frnt,IntS,Intj,Loca,ModS,Modi,NCoS,NCop,Nega,Objc,PrAd,PrcS,PreC,PreO,PreS,Pred,PtcO,Ques,Rela,Subj,Supp,Time,Unkn,Voct
Adju,6067,74,15,,,,6,,,10,,31,,,,43,19,,65,,,,,,1,43,,15,,2
Cmpl,90,22418,12,,,,5,,,14,,6,,,1,79,26,,21,,,3,,1,1,71,3,6,,2
Conj,87,27,33540,,,,6,,,8,,154,,,,36,2,,5,2,3,18,,,39,22,,7,,1
ExsS,,,,,7,,,,,,,,,,,,,,,,,,,,,,,,,
Exst,,,,,2,90,,,,,1,,3,9,,,,,,,,,,,,,,,,
Frnt,8,2,1,,,,785,,,,,,,,,8,,,12,,,,,,,22,,,,5
IntS,,,,,,,,161,,,,,,,,,,,1,,,,,,,,,,,
Intj,,1,2,,,,,17,1199,,,16,,,,5,,,9,,,3,,5,,5,,1,,1
IrpC,1,18,,,,,,,,,,,,,,,,,3,,,,,,,,,,,



##############################################################################################
#                                                                                            #
#      6m 42s Phrase function change from version 4 to 4b                                    #
#                                                                                            #
##############################################################################################



0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29
4\4b,Adju,Cmpl,Conj,EPPr,ExsS,Exst,Frnt,IntS,Intj,Loca,ModS,Modi,NCoS,NCop,Nega,Objc,PrAd,PrcS,PreC,PreO,PreS,Pred,PtcO,Ques,Rela,Subj,Supp,Time,Voct
Adju,8061,94,13,,,,7,,,10,,206,,,1,155,82,,65,5,1,,,8,1,186,3,17,
Cmpl,77,27606,9,,,,2,,,10,,8,,,,65,7,,105,,,3,1,1,5,86,,2,3
Conj,44,39,45936,,,,17,,,10,,6,,,,110,1,,19,1,,3,,,74,42,,7,1
EPPr,,,,4,,,,,,,,,,,,,,,,,,,,,,,,,
ExsS,,,,,14,,,,,,,,,,,,,,,,,,,,,,,,
Exst,,,,,,143,,,,,,,,1,,,,,,,,,,,,,,,
Frnt,1,5,,,,,1007,1,,,,,,,,2,,,5,,,,,,,5,,,3
IntS,,,,,,,,250,,,,,,,,,,,,,,,,,,,,,
Intj,,,,,,,,,1624,,,,,,,,,,,,,,,,,1,,,3



##############################################################################################
#                                                                                            #
#      6m 44s Phrase function change from version 4b to 2016                                 #
#                                                                                            #
##############################################################################################



0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29
4b\2016,Adju,Cmpl,Conj,EPPr,ExsS,Exst,Frnt,IntS,Intj,Loca,ModS,Modi,NCoS,NCop,Nega,Objc,PrAd,PrcS,PreC,PreO,PreS,Pred,PtcO,Ques,Rela,Subj,Supp,Time,Voct
Adju,9477,31,1,,,,5,,,1,,1,,,1,11,1,,6,,,,,,1,8,2,1,
Cmpl,39,29921,1,,,,6,,,8,,1,,,2,41,,,24,,,,,,,11,,7,
Conj,1,,46124,,,,,,,,,,,,,,,,1,,,,,,2,1,,,
EPPr,,,,9,,,,,,,,,,,,,,,,,,,,,,,,,
ExsS,,,,,14,,,,,,,,,,,,,,,,,,,,,,,,
Exst,,,,,,143,,,,,,,,,,,,,,,,,,,,,,,
Frnt,,,,,,,1087,,,,,,,,,,,,,,,,,,,25,,,
IntS,,,,,,,,251,,,,,,,,,,,,,,,,,,,,,
Intj,,,,,,,,,1621,,,,,,,,,,,,,,,,,,,,



##############################################################################################
#                                                                                            #
#      6m 45s Phrase function change from version 2016 to 2017                               #
#                                                                                            #
##############################################################################################



0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29
2016\2017,Adju,Cmpl,Conj,EPPr,ExsS,Exst,Frnt,IntS,Intj,Loca,ModS,Modi,NCoS,NCop,Nega,Objc,PrAd,PrcS,PreC,PreO,PreS,Pred,PtcO,Ques,Rela,Subj,Supp,Time,Voct
Adju,9508,12,2,,,,5,,,2,,,,,,5,,,4,,,,,,,2,,6,
Cmpl,16,30002,4,,,,,,,,,,,,,13,,,1,,,,,,,1,,,
Conj,1,,46135,,,,,,,,,,,,,3,,,,,,,,,3,1,,,
EPPr,,,,21,,,,,,,,,,,,,,,,,,,,,,,,,
ExsS,,,,,14,,,,,,,,,,,,,,,,,,,,,,,,
Exst,,,,,,143,,,,,,,,,,,,,,,,,,,,,,,
Frnt,,1,,,,,1119,,,,,,,,,,,,1,,,,1,,,9,,,
IntS,,,,,,,,251,,,,,,,,,,,,,,,,,,,,,
Intj,,,,,,,,,1621,,,,,,,,,,,,,,,,,,,,


# Boundary statistics

In [24]:
for (i, w) in enumerate(versions):
    if i == 0: continue
    v = versions[i-1]
    caption(1, 'Phrase boundary change from version {} to {}'.format(v, w))
    showStats(v, w)


##############################################################################################
#                                                                                            #
#     21m 22s Phrase boundary change from version 3 to 4                                     #
#                                                                                            #
##############################################################################################



0,1
dissimilarity,number of phrases
0,250346
1,2909
2,1385
3,951
4,556
5,342
6,218
7,141
8,105



##############################################################################################
#                                                                                            #
#     21m 22s Phrase boundary change from version 4 to 4b                                    #
#                                                                                            #
##############################################################################################



0,1
dissimilarity,number of phrases
0,250751
1,832
2,843
3,713
4,436
5,336
6,210
7,170
8,150



##############################################################################################
#                                                                                            #
#     21m 22s Phrase boundary change from version 4b to 2016                                 #
#                                                                                            #
##############################################################################################



0,1
dissimilarity,number of phrases
0,252881
1,110
2,80
3,61
4,27
5,17
6,9
7,12
8,15



##############################################################################################
#                                                                                            #
#     21m 22s Phrase boundary change from version 2016 to 2017                               #
#                                                                                            #
##############################################################################################



0,1
dissimilarity,number of phrases
0,253073
1,29
2,22
3,18
4,13
5,5
6,5
7,6
8,1


# Start
Start the program here.

In [1]:
import os,collections
from functools import reduce
from utils import caption
from tf.fabric import Fabric

from IPython.display import HTML, display

We specify our versions and the subtle differences between them as far as they are relevant.

In [2]:
REPO = os.path.expanduser('~/github/etcbc/bhsa')
baseDir = '{}/tf'.format(REPO)
tempDir = '{}/_temp'.format(REPO)

versions = '''
    3 
    4 
    4b 
    2016
    2017
'''.strip().split()

versionInfoSpec = {
    '': dict(
            OCC='g_word',
            LEX='lex',
            FUNCTION='function',
        ),
    '3': dict(
            OCC='text_plain',
            LEX='lexeme',
            FUNCTION='phrase_function',
        ),
}

versionInfo = {}

defaults = versionInfoSpec[''].items()

for (i, v) in enumerate(versions):
    versionInfo.setdefault(v, {})['OMAP'] = '' if i == 0 else 'omap@{}-{}'.format(versions[i-1], v)
    versionInfo[v].update(versionInfoSpec.get('', {}))
    versionInfo[v].update(versionInfoSpec.get(v, {}))

Load all versions in one go, with the version mapping feature if present.

In [3]:
TF = {}
api = {}
for (i, v) in enumerate(versions):
    for (param, value) in versionInfo[v].items():
        globals()[param] = value
    caption(4, 'Version -> {} <- loading ...'.format(v))
    TF[v] = Fabric(locations='{}/{}'.format(baseDir, v), modules=[''])
    api[v] = TF[v].load(' '.join((OCC, LEX, FUNCTION, OMAP)))

..............................................................................................
.       0.00s Version -> 3 <- loading ...                                                    .
..............................................................................................
This is Text-Fabric 3.0.8
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

118 features found and 0 ignored
  0.00s loading features ...
   |     0.12s B lexeme               from /Users/dirk/github/etcbc/bhsa/tf/3
   |     0.17s B text_plain           from /Users/dirk/github/etcbc/bhsa/tf/3
   |     0.08s B phrase_function      from /Users/dirk/github/etcbc/bhsa/tf/3
   |     0.00s Feature overview: 115 for nodes; 2 for edges; 1 configs; 7 computed
  4.79s All features loaded/computed - for details use loadLog()
...........................

# Utility function: tables in your cells

In [None]:
def tableText(table):
    return display(HTML(
        '<table><tr>{}</tr></table>'.format(
            '</tr><tr>'.join(
                '<td>{}</td>'.format('</td><td>'.join(str(_) for _ in row)) for row in table)
            )
     ))

# Get counterparts

Here is a function that gets the counterparts of phrases between versions, and classifies them according to dissimilarity.

`phraseMapping` is keyed by a (source version, target verison) pair,
then by dissimilarity, then by node in source version, and then
the value is a node in the target version.

Source nodes that lack a counterpart, end up in a bucket with dissimilarity -1.

In [4]:
phraseMapping = {}

In [5]:
def getPhrases(v, w):
    V = api[v]
    W = api[w]
    mapVW = 'omap@{}-{}'.format(v, w)
    vKey = (v, w)
    
    phraseMapping[vKey] = {}
    phrases = phraseMapping[vKey]
    
    for v in V.F.otype.s('phrase'):
        ws = W.Es(mapVW).f(v)
        if ws == None:
            phrases.setdefault(-1, set()).add(v)
        else:
            for (w, dis) in ws:
                phrases.setdefault(0 if dis == None else dis, {}).setdefault(v, set()).add(w)

# Table of boundary changes

In [22]:
def showStats(v, w):
    vKey = (v, w)
    phrases = phraseMapping[vKey]
    stats = collections.Counter()
    for dis in phrases:
        stats[dis] = len(phrases[dis])
    table = []
    table.append(['dissimilarity', 'number of phrases'])
    for dis in range(0, max(stats) + 1):
        table.append([dis, stats.get(dis, '')])
    tableText(table)

# Table of old and new values
We visualise the changes in the values of the `function` feature,
by generating a matrix, with old values in the row headers
and new values in the column headers, and the number of times that this old feature has changed into that new
feature in the corresponding matrix cells.

In [7]:
def featureDiff(v, w, feat):
    V = api[v]
    W = api[w]
    vKey = (v, w)
    vFeat = versionInfo[v][feat]
    wFeat = versionInfo[w][feat]
    phrases = phraseMapping[vKey]

    combis = {}
    for (dis, mapping) in phrases.items():
        for (n, ms) in mapping.items():
            vVal = V.Fs(vFeat).v(n)
            for m in ms:
                wVal = W.Fs(wFeat).v(m)
                combis.setdefault(vVal, collections.Counter())[wVal] += 1
    vValues = sorted(combis.keys())
    wValues = sorted(reduce(set.union, [set(combis[v]) for v in vValues], set()))
    #print('{} {}'.format('V', ' '.join(vValues)))
    #print('{} {}'.format('W', ' '.join(wValues)))
    table = []
    table.append(['{}\\{}'.format(v, w)] + wValues)
    for v in vValues:
        table.append([v] + [str(combis[v].get(w, '')) for w in wValues])
    tableText(table)   

# Collect
We collect all data in a big data structure.

In [9]:
caption(4, 'Collecting data')
for (i, w) in enumerate(versions):
    if i == 0: continue
    v = versions[i-1]
    caption(0, '\t{:<4} => {:<4}'.format(v, w))
    getPhrases(v, w)
caption(0, 'Done')

..............................................................................................
.      5m 49s Collecting data                                                                .
..............................................................................................
|      5m 49s 	3    => 4   
|      5m 51s 	4    => 4b  
|      5m 52s 	4b   => 2016
|      5m 54s 	2016 => 2017
|      5m 55s Done
