<img align="right" src="images/tf.png" width="200"/>
<img align="right" src="images/huc.png" width="200"/>
<img align="right" src="images/logo.png" width="200"/>

---

To get started: consult [start](start.ipynb)

---

# Named Entities

A research group has applied a NER-algorithm to this corpus (Named Entity Recognition) and 
delivered the results as Text-Fabric features in 
[cltl/voc-missives](https://github.com/cltl/voc-missives).

We can use these shared features, they are in `export/tf` and we see that they have been produced
against version `1.0` of the corpus data.

This is an example of the practice of using research results that others have shared.

Note that we do not have to do a manual effort to get the data and to integrate it in the corpus.
We refer to them by their location on GitHub, and Text-Fabric does the rest.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import collections

from tf.app import use

Note that we draw in an earlier version, in which there were still some encoding problems.
Note that when time goes on, the entities may evolve further, and also the main corpus may evolve further.

If the main corpus gets ahead of the entities, we have ways to preserver the older entities against the newer corpus.
In the [porting](porting.ipynb) notebook we'll show how we can carry over
the entity features from an older version to a newer version of the main corpus.

**Temporary notice**

At the moment of writing, the entity features are present in the
[cltl/voc-missives](https://github.com/cltl/voc-missives)
repo, but only on the dev branch.
We have made a copy of these features in this repo, in `voc-missives/export/tf`.

In [3]:
A = use(
    "CLARIAH/wp6-missieven",
    version="1.0",
    mod="CLARIAH/wp6-missieven/voc-missives/export/tf:hot", # copy of entity data
    # mod="cltl/voc-missives/export/tf",
    hoist=globals(),
)

Above you see a new section in the feature list that you can expand to see
which features that module contributed.

Now, suppose did not know much about these feature, then we would like to do a few basic checks.

A good start it to do inspect a frequency list of the values of the new features,
and then to perform a query looking for the nodes that have these features.

## Entities

First the identities:

In [4]:
F.entityId.freqList()[0:20]

(('e_1_27_20', 7),
 ('e_1_89_31', 7),
 ('e_5_26_55', 7),
 ('e_8_10_20', 7),
 ('e_8_13_21', 7),
 ('e_8_17_18', 7),
 ('e_8_19_18', 7),
 ('e_9_18_10', 7),
 ('e_10_14_10', 6),
 ('e_10_15_9', 6),
 ('e_10_22_5', 6),
 ('e_10_23_4', 6),
 ('e_10_4_11', 6),
 ('e_12_18_9', 6),
 ('e_12_19_10', 6),
 ('e_12_20_9', 6),
 ('e_1_13_7', 6),
 ('e_1_24_19', 6),
 ('e_2_13_43', 6),
 ('e_2_17_23', 6))

In [5]:
len(F.entityId.freqList())

17746

So tens of thousands named entity as been detected, and the most frequent ones occur only 8 times!
Because of that we expected that no coreference analysis has been conducted, but that each entity occurrence
has its own id. When an entity occurrence consists of multiple words, we expect to see the same entity id repeated over those words.

Now the kinds of named entities:

In [6]:
F.entityKind.freqList()

(('PER', 15701),
 ('LOC', 7204),
 ('SHP', 5359),
 ('ORG', 1271),
 ('LOCderiv', 1239),
 ('RELderiv', 42))

We guess at the legend:

* `LOC` = location
* `PER` = person
* `ORG` = organisation
* `SHP` = ship
* `REL` = relationship (??)
* *xxx*`deriv` = derived from a *xxx*

We want to know what the identifier means:

* does it connect the separate words of one entity *occurrence*?
* does it connect the multiple occurrences of a single named entity throughout the text?

We'll test. 

We collect the entity occurrences. We define an entity occurrence as the first word node in a streak of words that carry the same entity id.

A streak is a sequence of words where all words have the same entity id, but some words may have no entity id.

In [7]:
eidOccs = {}

curStreak = []
curId = None

def addEntity():
    eidOccs.setdefault(curId, []).append(tuple(curStreak))

for w in F.otype.s("word"):
    eid = F.entityId.v(w)
    
    if eid is None:
        if curId is not None:
            addEntity()
            curId = None
        continue
        
    if eid == curId:
        curStreak.append(w)
        continue
        
    if curId is not None:
        addEntity()
        
    curId = eid
    curStreak = [w]
    
if curId is not None:
    eidOccs.setdefault(curId, []).append(tuple(curStreak))
    
len(eidOccs)

17746

We only have to count the maximum number of occurrences that an entity id has, to decide what the id means.

In [8]:
max(len(occs) for occs in eidOccs.values())

2

That's interesting, let's make a count of the distribution.

In [9]:
eidFreq = collections.Counter()

for occs in eidOccs.values():
    eidFreq[len(occs)] += 1
    
eidFreq

Counter({1: 17736, 2: 10})

The vast majority of entities has a single occurrence.

Let's inspect the ones with multiple occurrences.

We make a table of them and where they occur.

In [10]:
example = 0
for ((eid, occs)) in eidOccs.items():
    if len(occs) < 2:
        continue
    
    table = []
    highlights = set()
    
    ekind = F.entityKind.v(occs[0][0])
    
    for occ in occs:
        firstWord = occ[0]
        lastWord = occ[-1]
        highlights |= set(occs)
        line = L.u(firstWord, "line")[0]
        table.append((line, *tuple(range(firstWord, lastWord + 1))))
    
    example += 1
    A.dm(f"""\n\n### Example {example}: `{ekind}` entity `{eid}` ({len(occs)}x)\n\n""")
    A.table(table, highlights=highlights, extraFeatures="entityId entityKind")
    # A.show(table, highlights=highlights, extraFeatures="entityId entityKind")



### Example 1: `LOC` entity `e_1_68_26` (2x)



n,p,line,word,Unnamed: 4
1,1 234:9,,Wight.,
2,1 234:13,,Siërra,"Leone,"




### Example 2: `PER` entity `e_1_89_31` (2x)



n,p,line,word,Unnamed: 4,Unnamed: 5,Unnamed: 6
1,1 433:22,,Sebesi.,,,
2,1 433:23,,Cornelis,van,der,"Lijn,"




### Example 3: `LOC` entity `e_2_20_40` (2x)



n,p,line,word
1,2 230:19,,Lamajuta
2,2 230:21,,"Cillebar,"




### Example 4: `LOC` entity `e_4_9_27` (2x)



n,p,line,word
1,4 125:11,,Thiel
2,4 125:12,,bharen .




### Example 5: `LOC` entity `e_4_12_33` (2x)



n,p,line,word
1,4 183:24,,Lima
2,4 183:25,,aff




### Example 6: `SHP` entity `e_4_45_31` (2x)



n,p,line,word,word.1
1,4 651:18,,Nieuw,Ceyt
2,4 651:19,,Amsterdam,




### Example 7: `PER` entity `e_5_16_17` (2x)



n,p,line,word
1,5 434:4,,Copie
2,5 434:6,,Het




### Example 8: `SHP` entity `e_9_20_42` (2x)



n,p,line,word
1,9 344:15,,Haemstede ^
2,9 344:16,,Heynkensant >




### Example 9: `SHP` entity `e_9_20_43` (2x)



n,p,line,word
1,9 344:17,,Schuytwijk '
2,9 344:18,,Rijxdorff




### Example 10: `PER` entity `e_9_20_53` (2x)



n,p,line,word,Unnamed: 4,Unnamed: 5,Unnamed: 6
1,9 344:26,,Cornelia ',,,
2,9 344:27,,’T,Casteel,van,Woerden


You can click on the passage indicators above and see a facsimile of the page.
After inspection of all these cases, it appears that they are one of these cases:

1. last word of a footnote plus first words of the next footnote (example 1)
1. last word of a footnote plus first words of the following main text (example 2)
1. word in main text, followed by a footnote, followed by a word in the main text (examples 3-6)
1. first word in a folio reference plus first word in main text after that folio ref (example 7)
1. words in a table cell that spans several rows (example 8)
1. words in two table cells in two rows but the same column (example 9, 10)

Let's query all words that have an entity notation:

In [11]:
query = """
word entityId entityKind*
"""
results = A.search(query)

  1.48s 29159 results


Here we query all words where the `entityId` is present.
We also mention the `entityKind` feature, but with a `*` behind it.
That is a criterion that is always True, so these mentions do not alter the result list.
But now these features do occur in the query, and hence will be shown in the results.

In [12]:
A.show(results, condensed=True, end=10)

# Observations

We see some glitches.

Lines 1, 5, 7: `Amsterdam` is correctly marked as ship, but the full name is `Het Wapen van Amsterdam`.

Line 2, 4: `De Mayo` is correctly marked as location in line 2, but the full name is `Ile de Mayo`,
and line 4 has this right.

Line 8: `ter rede van Mauritius` is incorrectly marked as a person, it is a location.

Line 9: `Pieter Both` is now incorrectly marked as two entities, a ship and an person,
while in lines 1, 5 , 7 it was correctly marked as a single personal entity.

Let's find all occurrences of Pieter Both and see their entity markings.

The following query does it:

In [13]:
query = """
word trans~(?i:Pieter) entityId* entityKind*
<: word trans~(?i:Both)
"""

results = A.search(query)

  9.71s 22 results


This is a moment to try out the Text-Fabric browser.

On the command prompt, give the command

```
text-fabric CLARIAH/wp6-missieven --mod=dirkroorda/voc-missives/export/tf
```

and then paste the above query in the search box and press the search icon. You'll see a table of results
which you can expand in order to inspect the results more carefully.

You can also jump from the text-fabric browser to the facsimiles.
Everything is under your finger tips.

Below you see that result 4 is the only strange result.

In [14]:
A.show(results)

Let's now make an overview of words that receive different entity kind features in different occurrences.
Only words that start with a capital letter.

In [15]:
wordKind = collections.defaultdict(lambda: collections.defaultdict(list))

for w in F.otype.s("word"):
    trans = F.trans.v(w)
    if not trans:
        continue
    if trans[0].lower() == trans[0]:
        continue
    transNorm = trans[0].upper() + trans[1:].lower()
    eKind = F.entityKind.v(w)
    wordKind[transNorm][eKind].append(w)
    
len(wordKind)

54018

How many of these words do not have any entity kind at all?

In [16]:
noEntity = []
hasEntity = {}

for (word, wordData) in wordKind.items():
    hasEnt = True
    if len(wordData) == 1:
        if None in wordData:
            noEntity.append((word, wordData[None]))
            hasEnt = False
    if hasEnt:
        hasEntity[word] = wordData
            
print(f"{len(noEntity):>5} words without entity occurrences")
print(f"{len(hasEntity):>5} words with entity occurrences")

49821 words without entity occurrences
 4197 words with entity occurrences


More than 90%!

Probably these are all words at the beginning of a sentence. Let's list the most frequent of them.

In [17]:
sortedNoEntity = sorted(noEntity, key=lambda x: (-len(x[1]), x[0]))
for (word, occs) in sortedNoEntity[0:50]:
    print(f"{word:<20} {len(occs):>5}x")

U                     4972x
Ed                    3752x
Ib                    3218x
Er                    3162x
Fol                   2370x
Dit                   2320x
Radja                 2139x
Hoogagtb              1889x
Edele                 1600x
Vorst                 1423x
Voor                  1347x
Hij                   1339x
Zijn                  1326x
Wel                   1318x
Ii                    1305x
Een                   1291x
Bij                   1168x
Sijn                  1095x
Dat                   1083x
Om                    1081x
Ook                   1045x
Uit                   1009x
Als                    981x
Coninck                961x
Wij                    884x
Sultan                 841x
Door                   808x
Daghregisters          791x
Tot                    707x
Daar                   693x
Volgens                678x
Daarom                 653x
Dog                    646x
Corpus                 630x
Nagapattinam           629x
Opkomst             

Let's also the the least frequent of them:

In [18]:
for (word, occs) in sortedNoEntity[-50:]:
    print(f"{word:<20} {len(occs):>5}x")

Zuyderpoeyer             1x
Zuydoostereilanden       1x
Zuydoosters              1x
Zuydtlandt               1x
Zuydtzee                 1x
Zuydwestereilanden       1x
Zuydwijck                1x
Zuyhn                    1x
Zuykermaelder            1x
Zuykermoolen             1x
Zuytbeveland             1x
Zuytcerammers            1x
Zuythollant              1x
Zuytlandts               1x
Zuytwijck                1x
Zu’1                     1x
Zu’l                     1x
Zwaardecroon’s           1x
Zwaardveger              1x
Zwaarmoedigheid          1x
Zwaegh                   1x
Zwakke                   1x
Zwanenburg               1x
Zwardecroon              1x
Zwart                    1x
Zwarts                   1x
Zwavel                   1x
Zweep                    1x
Zwervende                1x
Zwervens                 1x
Zwerver                  1x
Zwervers                 1x
Zwet                     1x
Zwid                     1x
Zwier                    1x
Zyan                

Clearly, it might pay off to inspect this list for missed named entities, especially in the regions of lesser frequency.

Back to the words that have been recognized as an entity. Let's divide them in several
classes:

A: words all of whose occurrences are an entity

A1: A-words that have 1 entity kind

A2: A-words that have 2 entity kinds

etc.

A3: A-words that have more than two entity kinds

B: words that have occurrences that are an entity and occurrences that are not an entity:

B1: B-words that have 1 entity kind

B2: B-words that have 2 entity kinds

etc.

In [19]:
classified = collections.defaultdict(list)

for (word, wordData) in hasEntity.items():
    n = len(wordData)
    if None in wordData:
        n -= 1
        clsMain = "B"
    else:
        clsMain = "A"
    cls = f"{clsMain}{n}"
    classified[cls].append(word)

In [20]:
for cls in sorted(classified):
    print(f"{cls}: {len(classified[cls]):>5} words")

A1:   750 words
A2:     3 words
B1:  3113 words
B2:   281 words
B3:    45 words
B4:     5 words


Let's inspect some cases in every category.

We see a lot of good things, especially in the A1 category, but also numerous
rough edges in the remaining categories.

In [21]:
for (cls, words) in sorted(classified.items()):
    nWords = len(words)
    pluralw = "" if nWords == 1 else "s"
    A.dm(f"### {cls}: {nWords} word{pluralw}\n\n")
    nWord = 0
    for word in words:
        nWord += 1
        wordData = hasEntity[word]
        for (kind, occs) in sorted(wordData.items(), key=lambda x: x[0] or ""):
            plural = "" if len(occs) == 1 else "s"
            A.dm(f"#### {cls}: case {nWord}: {word} as {kind}: {len(occs)} occurrence{plural}\n\n")
            A.show(tuple((occ,) for occ in occs), end=1)
        if nWord >= 5:
            break
    if nWords > 5:
        A.dm(f"#### {nWords - 5} more cases")

### A1: 750 words



#### A1: case 1: Verdische as LOC: 2 occurrences



#### A1: case 2: Labatacque as LOC: 1 occurrence



#### A1: case 3: Labetaka as LOC: 1 occurrence



#### A1: case 4: Amboynoysen as LOCderiv: 1 occurrence



#### A1: case 5: Romulus as LOC: 1 occurrence



#### 745 more cases

### A2: 3 words



#### A2: case 1: Medicis as PER: 1 occurrence



#### A2: case 1: Medicis as SHP: 1 occurrence



#### A2: case 2: Prinseneilanden as LOC: 4 occurrences



#### A2: case 2: Prinseneilanden as SHP: 1 occurrence



#### A2: case 3: Namrud as LOC: 1 occurrence



#### A2: case 3: Namrud as PER: 2 occurrences



### B1: 3113 words



#### B1: case 1: Aan as None: 958 occurrences



#### B1: case 1: Aan as PER: 5 occurrences



#### B1: case 2: Wapen as None: 150 occurrences



#### B1: case 2: Wapen as SHP: 23 occurrences



#### B1: case 3: Ile as None: 2 occurrences



#### B1: case 3: Ile as LOC: 1 occurrence



#### B1: case 4: Mayo as None: 2 occurrences



#### B1: case 4: Mayo as LOC: 3 occurrences



#### B1: case 5: Heren as None: 2898 occurrences



#### B1: case 5: Heren as ORG: 147 occurrences



#### 3108 more cases

### B2: 281 words



#### B2: case 1: I as None: 916 occurrences



#### B2: case 1: I as ORG: 1 occurrence



#### B2: case 1: I as SHP: 1 occurrence



#### B2: case 2: Pieter as None: 1195 occurrences



#### B2: case 2: Pieter as PER: 371 occurrences



#### B2: case 2: Pieter as SHP: 2 occurrences



#### B2: case 3: Both as None: 11 occurrences



#### B2: case 3: Both as LOC: 1 occurrence



#### B2: case 3: Both as PER: 35 occurrences



#### B2: case 4: Van as None: 5252 occurrences



#### B2: case 4: Van as PER: 1587 occurrences



#### B2: case 4: Van as SHP: 1 occurrence



#### B2: case 5: Indië as None: 593 occurrences



#### B2: case 5: Indië as LOC: 22 occurrences



#### B2: case 5: Indië as ORG: 2 occurrences



#### 276 more cases

### B3: 45 words



#### B3: case 1: Het as None: 8661 occurrences



#### B3: case 1: Het as LOC: 2 occurrences



#### B3: case 1: Het as PER: 1 occurrence



#### B3: case 1: Het as SHP: 1 occurrence



#### B3: case 2: Kaap as None: 1046 occurrences



#### B3: case 2: Kaap as LOC: 144 occurrences



#### B3: case 2: Kaap as LOCderiv: 1 occurrence



#### B3: case 2: Kaap as SHP: 2 occurrences



#### B3: case 3: Ter as None: 380 occurrences



#### B3: case 3: Ter as LOCderiv: 1 occurrence



#### B3: case 3: Ter as PER: 6 occurrences



#### B3: case 3: Ter as SHP: 4 occurrences



#### B3: case 4: Mauritius as None: 254 occurrences



#### B3: case 4: Mauritius as LOC: 23 occurrences



#### B3: case 4: Mauritius as PER: 8 occurrences



#### B3: case 4: Mauritius as SHP: 2 occurrences



#### B3: case 5: Kasteel as None: 156 occurrences



#### B3: case 5: Kasteel as LOC: 7 occurrences



#### B3: case 5: Kasteel as PER: 4 occurrences



#### B3: case 5: Kasteel as SHP: 26 occurrences



#### 40 more cases

### B4: 5 words



#### B4: case 1: Amsterdam as None: 1238 occurrences



#### B4: case 1: Amsterdam as LOC: 123 occurrences



#### B4: case 1: Amsterdam as ORG: 3 occurrences



#### B4: case 1: Amsterdam as PER: 7 occurrences



#### B4: case 1: Amsterdam as SHP: 17 occurrences



#### B4: case 2: De as None: 24227 occurrences



#### B4: case 2: De as LOC: 2 occurrences



#### B4: case 2: De as ORG: 1 occurrence



#### B4: case 2: De as PER: 572 occurrences



#### B4: case 2: De as SHP: 7 occurrences



#### B4: case 3: Banda as None: 2392 occurrences



#### B4: case 3: Banda as LOC: 176 occurrences



#### B4: case 3: Banda as LOCderiv: 1 occurrence



#### B4: case 3: Banda as PER: 1 occurrence



#### B4: case 3: Banda as SHP: 1 occurrence



#### B4: case 4: Groot as None: 417 occurrences



#### B4: case 4: Groot as LOC: 2 occurrences



#### B4: case 4: Groot as ORG: 1 occurrence



#### B4: case 4: Groot as PER: 17 occurrences



#### B4: case 4: Groot as SHP: 1 occurrence



#### B4: case 5: Hoorn as None: 547 occurrences



#### B4: case 5: Hoorn as LOC: 45 occurrences



#### B4: case 5: Hoorn as ORG: 1 occurrence



#### B4: case 5: Hoorn as PER: 169 occurrences



#### B4: case 5: Hoorn as SHP: 17 occurrences



---

# Contents

* **[start](start.ipynb)** start computing with this corpus
* **[search](search.ipynb)** turbo charge your hand-coding with search templates
* **[compute](compute.ipynb)** sink down a level and compute it yourself
* **[exportExcel](exportExcel.ipynb)** make tailor-made spreadsheets out of your results
* **[annotate](annotate.ipynb)** export text, annotate with BRAT, import annotations
* **[share](share.ipynb)** draw in other people's data and let them use yours
* **entities** use results of third-party NER (named entity recognition)
* **[porting](porting.ipynb)** port features made against an older version to a newer version
* **[volumes](volumes.ipynb)** work with selected volumes only

CC-BY Dirk Roorda