<img align="right" src="images/tf.png" width="200"/>
<img align="right" src="images/huc.png" width="200"/>
<img align="right" src="images/logo.png" width="200"/>

---

To get started: consult [start](start.ipynb)

---

# Porting annotations

In the [entities](entities.ipynb) notebook we saw how we could use third-party features
with our corpus. However, these features have been constructed against an older version of the corpus.

The corpus has moved on, and we want to port those annotations to the newest version.
Text-Fabric has machinery to help with that. It turns out that we have
to make a mapping between the slots of both versions, and then Text-Fabric can do the rest.

With that mapping in hand, we can port *all* features, past, present and future, 
automatically from the older version to the newer, and vice versa.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from tf.app import use

# Make a slot mapping

In this notebook we map the *slot* nodes from version 0.8.1 (source version) to 1.0 (target version).

Basically this means that we map all slots from the source version to corresponding slots in the target version.

Some slots have an empty text (most of them contain some punctuation).

We do not want to be fussy about those slots.
We map them unto corresponding empty slots if possible, otherwise we map them onto the nearest
non-empty slot.

After establishing the slot mapping, we extend the mapping to all nodes in a generic way.
The code for this is already in the TF library.

In [3]:
from tf.dataset import Versions

va = "0.8.1"
vb = "1.0"

We load the data for both versions.

This time, we work in our GitHub clone, because we want to make the resulting
map available to everyone, after pushing the clone to GitHub.

In [4]:
A = {}

In [5]:
for v in (va, vb):
    A[v] = use(
        "CLARIAH/wp6-missieven:clone",
        checkout="clone",
        silent="deep",
        version=v
    )

We walk through the slots of the target version.
For each target slot we increase the slot in the source version, and check whether
source and target slots have the same value for the `trans` feature.
If not, and one of them is empty, we skip the empty word and try the next one.
But if both are not empty and unequal, we have a real problem: a mismatch.

In that case we stop, and you have to inspect what is happening.

In [8]:
def makeSlotMap():
    Fa = A[va].api.F
    Fb = A[vb].api.F
    transA = Fa.trans.v
    transB = Fb.trans.v
    maxSlotA = Fa.otype.maxSlot
    maxSlotB = Fb.otype.maxSlot

    print(
        f"""\
    Computing slotMap between:
    {va}: {maxSlotA:>8} slots,
    {vb}: {maxSlotB:>8} slots.\
"""
    )

    slotMap = {}

    good = True
    wA = 1
    wB = 1

    while wB <= maxSlotB and wA <= maxSlotA:
        textA = transA(wA) or ""
        textB = transB(wB) or ""

        if textA == textB:
            slotMap.setdefault(wA, {})[wB] = None
            wA += 1
            wB += 1

        elif textA.startswith(textB):
            slotMap.setdefault(wA, {})[wB] = None
            wB += 1
        elif textA.endswith(textB):
            wA += 1
            wB += 1

        elif textB.startswith(textA):
            slotMap.setdefault(wA, {})[wB] = None
            wA += 1
        elif textB.endswith(textA):
            slotMap.setdefault(wA, {})[wB] = None
            wA += 1
            wB += 1

        else:
            print("Mismatch:")
            print(f"A: {wA:>8} = `{textA}`")
            print(f"B: {wB:>8} = `{textB}`")
            good = False
            break

    maxSlotMap = max(slotMap)
    if maxSlotMap > maxSlotA:
        print(f"maxSlot in A version {va} exceeded")
        print(f"Found {maxSlotMap}, but it should be <= {maxSlot[va]}")
        good = False

    if good:
        print(
            f"""\
slotMap succesfully created: {len(slotMap)} slots mapped.
"""
        )
    return slotMap

In [9]:
slotMap = makeSlotMap()

    Computing slotMap between:
    0.8.1:  5316429 slots,
    1.0:  5977367 slots.
slotMap succesfully created: 5316429 slots mapped.



Note that as of version 1.0 volume 14 has been included.

So we expect a discrepancy there.

And of course, we will not have entity feature values for volume 14.

When we encounter problems, we can do a bit of checking to see what is going on.

The next function shows the line around a slot node, and can do so in both versions.

In [10]:
def show(v, n):
    F = A[v].api.F
    L = A[v].api.L
    T = A[v].api.T
    
    lines = L.u(n, otype="line")
    if not lines:
        lines = L.u(n + 1, otype="line")
    if not lines:
        lines = L.u(n - 1, otype="line")
    if not lines:
        print("no such line")
        return
    line = lines[0]
    print(T.sectionFromNode(line))
    words = L.d(line, otype="word")
    print(" ".join(f"[{w}={F.trans.v(w)}]" for w in words))
    print(T.text(line))

In [11]:
show(va, 49)
show(vb, 49)

(1, 3, 4)
[46=Generaal] [47=aan] [48=boord] [49=vertoefde] [50=verandert] [51=daaraan] [52=niets] [53=Ile] [54=de] [55=Mayo] [56=is] [57=een] [58=der] [59=Kaap] [60=Verdische]
Generaal aan boord vertoefde, verandert daaraan niets. Ile de Mayo is een der Kaap-Verdische 
(1, 3, 4)
[45=Generaal] [46=aan] [47=boord] [48=vertoefde] [49=verandert] [50=daaraan] [51=niets] [52=Ile] [53=de] [54=Mayo] [55=is] [56=een] [57=der] [58=Kaap] [59=Verdische]
Generaal aan boord vertoefde, verandert daaraan niets. Ile de Mayo is een der Kaap-Verdische 


## Make the complete node map

We now extend the slotMap to a full node map.

See [dataset.Versions](https://annotation.github.io/text-fabric/tf/dataset/nodemaps.html#tf.dataset.nodemaps.Versions) in the Text-Fabric documentation.

In [13]:
V = Versions({v: A[v].api for v in (va, vb)}, va, vb, silent="auto", slotMap=slotMap)

In [14]:
V.makeVersionMapping()

 5m 45s 
    **********************************************************************************************
    *                                                                                            *
    * Mapping volume nodes 0.8.1 ==> 1.0                                                         *
    *                                                                                            *
    **********************************************************************************************
    
 5m 56s ..............................................................................................
    . Statistics for 0.8.1 ==> 1.0 (volume)                                                      .
    ..............................................................................................
 5m 56s | 	TOTAL                          : 100.00%      13x
 5m 56s | 	unique, perfect                :  92.31%      12x
 5m 56s | 	multiple, non-perfect          :   7.69%       1x
 5m 56s

## Migrate the features

It seems that the node map is not perfect, but we did not expect that.

We migrate the entity features nevertheless.

Remember that they are not in the corpus, but in a third party module of features.

In [31]:
THIRD_PARTY = "dirkroorda/voc-missives/export/tf"

A[va] = use(
    "CLARIAH/wp6-missieven:clone",
    checkout="clone",
    mod=f"{THIRD_PARTY}:clone",
    silent="deep",
    version=va,
)
api[va] = A[va].api

In [32]:
V = Versions(api, va, vb, silent="auto")
V.migrateFeatures(("entityId", "entityKind"), location=f"~/github/{THIRD_PARTY}")

 2m 18s start migrating
   |       46s T omap@0.8.1-1.0       from ~/github/CLARIAH/wp6-missieven/tf/1.0
    47s All additional features loaded - for details use TF.isLoaded()
    47s Mapping entityId (node)
    47s Mapping entityKind (node)
  0.00s Exporting 2 node and 0 edge and 0 config features to ~/github/dirkroorda/voc-missives/export/tf/1.0:
   |     0.03s T entityId             to ~/github/dirkroorda/voc-missives/export/tf/1.0
   |     0.02s T entityKind           to ~/github/dirkroorda/voc-missives/export/tf/1.0
  0.05s Exported 2 node features and 0 edge features and 0 config features to ~/github/dirkroorda/voc-missives/export/tf/1.0
  0.05s Done


Now we can do a pull request to the third party to include a new version of the feature.

If the third party is no longer active, we can add the features to the corpus, by omitting the `location=`
parameter in the call to `migrateFeatures`.

Or even better, we can add them to a module in the same repo as the corpus,
such as `entity/tf`

In [33]:
org = A[va].context.org
repo = A[va].context.repo
V.migrateFeatures(("entityId", "entityKind"), location=f"~/github/{org}/{repo}/entities/tf")

  8.99s start migrating
  1.31s All additional features loaded - for details use TF.isLoaded()
  1.31s Mapping entityId (node)
  1.35s Mapping entityKind (node)
  0.00s Exporting 2 node and 0 edge and 0 config features to ~/github/CLARIAH/wp6-missieven/entities/tf/1.0:
   |     0.02s T entityId             to ~/github/CLARIAH/wp6-missieven/entities/tf/1.0
   |     0.02s T entityKind           to ~/github/CLARIAH/wp6-missieven/entities/tf/1.0
  0.05s Exported 2 node features and 0 edge features and 0 config features to ~/github/CLARIAH/wp6-missieven/entities/tf/1.0
  0.05s Done


## Check

Now let's check by loading the new version of the corpus with the migrated entities,
and perform the same query as in the [entities](entities.ipynb) notebook.

In [34]:
A = use(
    "CLARIAH/wp6-missieven:clone",
    checkout="clone",
    mod="CLARIAH/wp6-missieven/entities/tf:clone",
    hoist=globals(),
    version="1.0",
)

In [35]:
F.entityId.freqList()[0:20]

(('e_n12_2_632', 8),
 ('e_n13_15_2306', 8),
 ('e_n7_8_809', 8),
 ('e_n13_15_1302', 7),
 ('e_n7_8_1080', 7),
 ('e_t10_15_108', 7),
 ('e_t10_15_273', 7),
 ('e_n10_11_715', 6),
 ('e_n12_14_130', 6),
 ('e_n12_2_383', 6),
 ('e_n12_2_578', 6),
 ('e_n13_15_154', 6),
 ('e_n13_15_1582', 6),
 ('e_n13_15_1894', 6),
 ('e_n13_15_285', 6),
 ('e_n5_28_103', 6),
 ('e_n5_28_34', 6),
 ('e_n5_28_675', 6),
 ('e_n5_7_515', 6),
 ('e_n8_6_710', 6))

In [36]:
len(F.entityId.freqList())

24500

In [37]:
F.entityKind.freqList()

(('LOC', 12790),
 ('PER', 10393),
 ('LOCderiv', 4279),
 ('ORG', 3841),
 ('SHP', 2922),
 ('GPE', 1153),
 ('RELderiv', 261),
 ('ORGpart', 58),
 ('LOCpart', 45),
 ('RELpart', 28),
 ('REL', 19))

In [38]:
query = """
word entityId entityKind*
"""
results = A.search(query)

  1.77s 32249 results


In [39]:
A.show(results, condensed=True, end=10)

A nice correspondence indeed!

However, it is preferable that the third party repeats the entity recognition
on the new version ofvthe corpus, so that the entities in volume 14 get recognized too.

---

# Contents

* **[start](start.ipynb)** start computing with this corpus
* **[search](search.ipynb)** turbo charge your hand-coding with search templates
* **[compute](compute.ipynb)** sink down a level and compute it yourself
* **[exportExcel](exportExcel.ipynb)** make tailor-made spreadsheets out of your results
* **[annotate](annotate.ipynb)** export text, annotate with BRAT, import annotations
* **[share](share.ipynb)** draw in other people's data and let them use yours
* **[entities](entities.ipynb)** use results of third-party NER (named entity recognition)
* **porting** port features made against an older version to a newer version
* **[volumes](volumes.ipynb)** work with selected volumes only

CC-BY Dirk Roorda