<img align="right" src="images/tf.png" width="200"/>
<img align="right" src="images/huc.png" width="200"/>
<img align="right" src="images/logo.png" width="200"/>

---

To get started: consult [start](start.ipynb)

---

# Named Entities

A research group has applied a NER-algorithm to this corpus (Named Entity Recognition) and 
delivered the results as Text-Fabric features in 
[cltl/voc-missives](https://github.com/cltl/voc-missives).

We can use these shared features, they are in `export/tf` and we see that they have been produced
against version `1.0` of the corpus data.

I have also copied these features to this repo, to `voc-missives/export/tf` because
at the moment of writing this, the features were still in a `dev` branch in the `cltl/voc-missives`
repo.

This is an example of the practice of using research results that others have shared.

Note that we do not have to do a manual effort to get the data and to integrate it in the corpus.
We refer to them by their location on GitHub, and Text-Fabric does the rest.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import collections

from tf.app import use

Note that we use a specific version of the entities against a specific version of the corpus.
Sooner or later we'll see newer versions of either. 

It is no problem to load newer versions of entity features against a corpus that has been used to create those features.

But what to do if we have a newer and better version of the corpus, but no newer entity features?
In the [porting](porting.ipynb) notebook we'll show how we can carry over
the entity features from an older version to a newer version.

In [3]:
A = use(
    "CLARIAH/wp6-missieven",
    version="1.0",
    mod="CLARIAH/wp6-missieven/voc-missives/export/tf", # local copy of entities in this repo
    # mod="cltl/voc-missives/export/tf", # repo where the entities have been created
    hoist=globals(),
)

The requested data is not available offline
	~/text-fabric-data/github/CLARIAH/wp6-missieven/voc-missives/export/tf/1.0 not found
No directory voc-missives/export/tf/1.0 in #f6e4276d1c8f9f15923d54ccf2339698820b2948Will try something else
	Failed

No directory voc-missives/export/tf/1.0 in #c2d32e4a7a00cde289584cf39c1b11de33e3f783	FailedThere was an error loading TF-app clariah/wp6-missieven from ~/text-fabric-data/github/clariah/wp6-missieven/app
AttributeError("'TfApp' object has no attribute 'TF'")
Traceback (most recent call last):
  File "/Users/me/github/annotation/text-fabric/tf/advanced/app.py", line 546, in findApp
    app = appClass(
  File "/Users/me/text-fabric-data/github/clariah/wp6-missieven/app/app.py", line 51, in __init__
    super().__init__(*args, **kwargs)
  File "/Users/me/github/annotation/text-fabric/tf/advanced/app.py", line 180, in __init__
    volumesApi(self)
  File "/Users/me/github/annotation/text-fabric/tf/advanced/volumes.py", line 39, in volumesApi
    TF = app.TF
AttributeError: 'TfApp' object has no attribute 'TF'
Text-Fabric is not loaded


Above you see a new section in the feature list that you can expand to see
which features that module contributed.

Now, suppose did not know much about these feature, then we would like to do a few basic checks.

A good start it to do inspect a frequency list of the values of the new features,
and then to perform a query looking for the nodes that have these features.

## Entities

First the identities:

In [6]:
F.entityId.freqList()[0:20]

(('e_t_1_3_1', 7),
 ('e_t_3_11_2', 7),
 ('e_t_5_26_27', 7),
 ('e_t_8_10_19', 7),
 ('e_t_8_13_21', 7),
 ('e_t_8_17_18', 7),
 ('e_t_8_19_18', 7),
 ('e_t_9_18_9', 7),
 ('e_n_11_24_34', 6),
 ('e_n_11_34_21', 6),
 ('e_n_12_4_20', 6),
 ('e_n_1_13_4', 6),
 ('e_n_1_59_45', 6),
 ('e_n_2_26_21', 6),
 ('e_n_2_7_12', 6),
 ('e_n_4_24_0', 6),
 ('e_n_6_38_34', 6),
 ('e_t_10_14_9', 6),
 ('e_t_10_15_9', 6),
 ('e_t_10_22_4', 6))

In [7]:
len(F.entityId.freqList())

28451

So tens of thousands named entity as been detected, and the most frequent ones occur only 8 times!

Now the kinds of named entities:

In [8]:
F.entityKind.freqList()

(('PER', 18718),
 ('LOC', 14819),
 ('SHP', 7199),
 ('LOCderiv', 2916),
 ('ORG', 2584),
 ('RELderiv', 130))

We guess at the legend:

* `LOC` = location
* `PER` = person
* `ORG` = organisation
* `SHP` = ship
* `REL` = relationship (??)
* *xxx*`deriv` = derived from a *xxx*

We want to know what the identifier means:

* does it connect the separate words of one entity *occurrence*?
* does it connect the multiple occurrences of a single named entity throughout the text?

We'll test. 

We collect the entity occurrences. We define an entity occurrence as the first word node in a streak of words that carry the same entity id.

A streak is a sequence of words where all words have the same entity id, but some words may have no entity id.

In [19]:
eidOccs = {}

curStreak = []
curId = None

for w in F.otype.s("word"):
    eid = F.entityId.v(w)
    
    if eid is None:
        continue
        
    if eid == curId:
        curStreak.append(w)
        continue
        
    if curId is not None:
        eidOccs.setdefault(curId, []).append(tuple(curStreak))
        
    curId = eid
    curStreak = [w]
    
if curId is not None:
    eidOccs.setdefault(curId, []).append(tuple(curStreak))
    
len(eidOccs)

28451

We only have to count the maximum number of occurrences that an entity id has, to decide what the id means.

In [20]:
max(len(occs) for occs in eidOccs.values())

3

That's interesting, let's make a count of the distribution.

In [21]:
eidFreq = collections.Counter()

for occs in eidOccs.values():
    eidFreq[len(occs)] += 1
    
eidFreq

Counter({1: 28411, 2: 37, 3: 3})

The vast majority of entities has a single occurrence.

Let's inspect the ones with multiple occurrences.

We make a table of them and where they occur.

In [29]:
for (eid, occs) in eidOccs.items():
    if len(occs) < 2:
        continue
    
    table = []
    highlights = set()
    
    ekind = F.entityKind.v(occs[0][0])
    
    for occ in occs:
        firstWord = occ[0]
        lastWord = occ[-1]
        highlights |= set(occs)
        line = L.u(firstWord, "line")[0]
        table.append((line, *tuple(range(firstWord, lastWord + 1))))
    
    A.dm(f"""\n\n### `{ekind}` entity `{eid}` ({len(occs)}x)\n\n""")
    A.table(table, highlights=highlights, extraFeatures="entityId entityKind")
    # A.show(table, highlights=highlights, extraFeatures="entityId entityKind")



### `PER` entity `e_t_1_40_10` (2x)



n,p,line,word,word.1
1,1 97:2,,WILLEM,JANSZ
2,1 97:26,,KASTEEL,JAKATRA




### `LOC` entity `e_n_1_68_3` (2x)



n,p,line,word,Unnamed: 4
1,1 234:9,,Wight.,
2,1 234:13,,Siërra,"Leone,"




### `PER` entity `e_n_1_89_17` (2x)



n,p,line,word,Unnamed: 4,Unnamed: 5,Unnamed: 6
1,1 433:22,,Sebesi.,,,
2,1 433:23,,Cornelis,van,der,"Lijn,"




### `SHP` entity `e_t_1_99_32` (2x)



n,p,line,word,word.1,word.2
1,1 596:15,,Prins,Willem,I
2,1 596:19,,Soo,,




### `LOC` entity `e_t_2_20_39` (2x)



n,p,line,word
1,2 230:19,,Lamajuta
2,2 230:21,,"Cillebar,"




### `PER` entity `e_t_2_25_0` (2x)



n,p,line,word,word.1,word.2,word.3
1,2 267:1,,CORNELIS,VAN,DER,LIJN
2,2 267:4,,JOAN,MAETSUYKER,,




### `PER` entity `e_t_2_43_1` (2x)



n,p,line,word,word.1
1,2 640:1,,CAREL,RENIERS
2,2 640:4,,JOAN,"MAETSUYKER,"




### `PER` entity `e_t_3_12_29` (2x)



n,p,line,word,Unnamed: 4
1,3 247:44,,Samado,
2,3 248:26,,Gene,Macassaren




### `PER` entity `e_t_4_12_9` (2x)



n,p,line,word,word.1,Unnamed: 5
1,4 183:2,,CONSTANTIN,RANST,
2,4 183:4,,PIETER,VAN,"HOORN,"




### `PER` entity `e_n_4_12_9` (2x)



n,p,line,word,word.1
1,4 183:15,,Hurdt,de »
2,4 183:17,,Hurdt,




### `LOC` entity `e_t_4_20_34` (2x)



n,p,line,word
1,4 309:34,,Curnagel
2,4 309:36,,Corle;




### `PER` entity `e_t_4_26_18` (2x)



n,p,line,word,word.1
1,4 405:3,,FREDERIK,BENT
2,4 405:5,,BATAVIA,




### `PER` entity `e_n_4_29_29` (2x)



n,p,line,word
1,4 426:36,,jr.
2,4 426:41,,"Vrees,"




### `PER` entity `e_n_4_40_58` (2x)



n,p,line,word,Unnamed: 4
1,4 496:32,,Krawang.,
2,4 496:35,,Sjech,"Jusuf,"




### `SHP` entity `e_t_4_45_21` (3x)



n,p,line,word,Unnamed: 4
1,4 651:17,,Haarlem,
2,4 651:18,,Nieuw,Ceyt
3,4 651:19,,Amsterdam,




### `LOC` entity `e_t_4_45_28` (2x)



n,p,line,word
1,4 651:25,,Delft
2,4 651:27,,Hoorn




### `PER` entity `e_t_5_27_46` (2x)



n,p,line,word,word.1,word.2,word.3
1,5 736:29,,Bruyn,Jansz.,van,Scheeve
2,5 736:31,,"Nederlant,",,,




### `PER` entity `e_n_7_41_34` (2x)



n,p,line,word
1,7 693:16,,Heynsiusj
2,7 693:20,,Die




### `SHP` entity `e_n_8_5_8` (2x)



n,p,line,word
1,8 59:9,,de »
2,8 59:19,,Men




### `SHP` entity `e_n_8_5_9` (2x)



n,p,line,word
1,8 59:19,,de »
2,8 59:24,,Men




### `LOC` entity `e_t_9_10_29` (3x)



n,p,line,word
1,9 112:18,,Amboina
2,9 112:24,,Banda
3,9 112:30,,Hieruyt




### `LOC` entity `e_t_9_14_27` (2x)



n,p,line,word
1,9 227:26,,Timor
2,9 227:39,,Palembang




### `LOC` entity `e_t_9_17_29` (2x)



n,p,line,word
1,9 294:12,,Amboina
2,9 294:39,,Banda




### `LOC` entity `e_t_9_24_22` (2x)



n,p,line,word
1,9 427:10,,Amboina
2,9 427:42,,Banda




### `LOC` entity `e_t_9_28_29` (2x)



n,p,line,word
1,9 525:15,,Amboina
2,9 525:20,,Banda




### `LOC` entity `e_t_9_35_85` (2x)



n,p,line,word
1,9 652:40,,Banda
2,9 653:5,,Makassar




### `LOC` entity `e_t_9_39_23` (2x)



n,p,line,word
1,9 750:21,,Amboina
2,9 750:25,,Banda




### `LOC` entity `e_t_10_5_24` (2x)



n,p,line,word
1,10 112:18,,Amboina
2,10 112:25,,Banda




### `PER` entity `e_t_10_20_21` (2x)



n,p,line,word
1,10 807:19,,Amboina
2,10 807:37,,Dientwegen




### `LOC` entity `e_t_10_21_51` (2x)



n,p,line,word
1,10 858:1,,Amboina
2,10 858:10,,Banda




### `LOC` entity `e_t_11_2_15` (2x)



n,p,line,word,Unnamed: 4
1,11 8:23,,Amboina,
2,11 8:28,,Ban,da




### `SHP` entity `e_n_11_27_17` (2x)



n,p,line,word
1,11 604:22,,Duynhoff
2,11 605:13,,Batavia




### `LOC` entity `e_t_11_28_30` (2x)



n,p,line,word
1,11 608:17,,Amboina
2,11 608:22,,Banda




### `LOC` entity `e_t_11_32_21` (2x)



n,p,line,word,Unnamed: 4
1,11 735:9,,Amboina,
2,11 735:16,,Banda,Banda




### `LOC` entity `e_t_13_1_42` (2x)



n,p,line,word
1,13 1:36,,Palembang
2,13 2:1,,SlAM




### `PER` entity `e_t_13_4_32` (2x)



n,p,line,word
1,13 111:13,,Timor
2,13 111:20,,SlAM




### `LOC` entity `e_t_13_11_28` (2x)



n,p,line,word
1,13 345:10,,Amboina
2,13 345:35,,Banda




### `LOC` entity `e_t_13_14_26` (3x)



n,p,line,word
1,13 483:7,,Amboina
2,13 483:13,,Banda
3,13 483:28,,Makassar




### `LOC` entity `e_t_13_15_22` (2x)



n,p,line,word
1,13 501:7,,Amboina
2,13 501:35,,Vermits




### `LOC` entity `e_t_13_16_24` (2x)



n,p,line,word
1,13 620:18,,Djambi
2,13 620:21,,Djambi


Let's query all words that have an entity notation:

In [7]:
query = """
word entityId entityKind*
"""
results = A.search(query)

  1.60s 32249 results


Here we query all words where the `entityId` is present.
We also mention the `entityKind` feature, but with a `*` behind it.
That is a criterion that is always True, so these mentions do not alter the result list.
But now these features do occur in the query, and hence will be shown in the results.

In [8]:
A.show(results, condensed=True, end=10)

---

# Contents

* **[start](start.ipynb)** start computing with this corpus
* **[search](search.ipynb)** turbo charge your hand-coding with search templates
* **[compute](compute.ipynb)** sink down a level and compute it yourself
* **[exportExcel](exportExcel.ipynb)** make tailor-made spreadsheets out of your results
* **[annotate](annotate.ipynb)** export text, annotate with BRAT, import annotations
* **[share](share.ipynb)** draw in other people's data and let them use yours
* **entities** use results of third-party NER (named entity recognition)
* **[porting](porting.ipynb)** port features made against an older version to a newer version
* **[volumes](volumes.ipynb)** work with selected volumes only

CC-BY Dirk Roorda