<img align="right" src="images/tf-small.png"/>

# Phrases in versions of the BHSA

In [versionMappings](https://github.com/ETCBC/bhsa/blob/master/programs/versionMappings.ipynb)
we have constructed edge features that map the nodes from one version of the data to the next.
In this notebook we are going to use those edges to study what happened to the feature `function`
of `phrases`.

# Overview
First we explore to what degree the boundaries of phrases have changed.

Secondly, we explore how the values of the `function` feature has changed.

# Discussion
The feature `function` was called `phrase_function` in version `3`.

## Phrase boundaries
In order to see whether phrase boundaries have changed, we follow the `omap@` edges from 
phrases in one version to their couterparts in the next version.

We make use of the dissimilarity values that are attached to such edges.
If there is no value, or the value is `0`, we have a match without a boundary change.
All other dissimilarities imply that boundaries have changed.

And we should also investigate if some phrases completely lack a counterpart.

In [1]:
import os,collections
from functools import reduce
from utils import caption
from tf.fabric import Fabric

We specify our versions and the subtle differences between them as far as they are relevant.

In [2]:
REPO = os.path.expanduser('~/github/etcbc/bhsa')
baseDir = '{}/tf'.format(REPO)
tempDir = '{}/_temp'.format(REPO)

versions = '''
    3 
    4 
    4b 
    2016
    2017
'''.strip().split()

versionInfoSpec = {
    '': dict(
            OCC='g_word',
            LEX='lex',
            FUNCTION='function',
        ),
    '3': dict(
            OCC='text_plain',
            LEX='lexeme',
            FUNCTION='phrase_function',
        ),
}

versionInfo = {}

defaults = versionInfoSpec[''].items()

for (i, v) in enumerate(versions):
    versionInfo.setdefault(v, {})['OMAP'] = '' if i == 0 else 'omap@{}-{}'.format(versions[i-1], v)
    versionInfo[v].update(versionInfoSpec.get('', {}))
    versionInfo[v].update(versionInfoSpec.get(v, {}))

Load all versions in one go, with the version mapping feature if present.

In [6]:
TF = {}
api = {}
for (i, v) in enumerate(versions):
    for (param, value) in versionInfo[v].items():
        globals()[param] = value
    caption(4, 'Version -> {} <- loading ...'.format(v))
    TF[v] = Fabric(locations='{}/{}'.format(baseDir, v), modules=[''])
    api[v] = TF[v].load(' '.join((OCC, LEX, FUNCTION, OMAP)))

..............................................................................................
.       0.00s Version -> 3 <- loading ...                                                    .
..............................................................................................
This is Text-Fabric 3.0.7
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

118 features found and 0 ignored
  0.00s loading features ...
   |     0.12s B lexeme               from /Users/dirk/github/etcbc/bhsa/tf/3
   |     0.17s B text_plain           from /Users/dirk/github/etcbc/bhsa/tf/3
   |     0.07s B phrase_function      from /Users/dirk/github/etcbc/bhsa/tf/3
   |     0.00s Feature overview: 115 for nodes; 2 for edges; 1 configs; 7 computed
  4.64s All features loaded/computed - for details use loadLog()
...........................

We want to switch easily between the APIs for the versions.

In [7]:
def activate(v):
    for (param, value) in versionInfo.get(v, versionInfo['']).items():
        globals()[param] = value
    api[v].makeAvailableIn(globals())
    caption(4, 'Active version is now -> {} <-'.format(v))

# Get counterparts

Here is a function that gets the counterparts of phrases between versions, and classifies them according to dissimilarity.

`phraseMapping` is keyed by a (source version, target verison) pair,
then by dissimilarity, then by node in source version, and then
the value is a node in the target version.

Source nodes that lack a counterpart, end up in a bucket with dissimilarity -1.

In [10]:
phraseMapping = {}

In [20]:
def getPhrases(v, w):
    V = api[v]
    W = api[w]
    mapVW = 'omap@{}-{}'.format(v, w)
    vKey = (v, w)
    
    phraseMapping[vKey] = {}
    phrases = phraseMapping[vKey]
    
    i = 0
    for v in V.F.otype.s('phrase'):
        i += 1
        ws = W.Es(mapVW).f(v)
        if ws == None:
            phrases.setdefault(-1, set()).add(v)
            i += 1
        else:
            for (w, dis) in ws:
                phrases.setdefault(dis, {}).setdefault(v, set()).add(w)
    print(i)

In [21]:
getPhrases('3', '4')

257109


In [17]:
sorted(phraseMapping[('3', '4')].keys())

[]