<img align="right" src="images/tf.png" width="200"/>
<img align="right" src="images/huc.png" width="200"/>
<img align="right" src="images/logo.png" width="200"/>

---

To get started: consult [start](start.ipynb)

---

# Named Entities

A research group has applied a NER-algorithm to this corpus (Named Entity Recognition) and 
delivered the results as Text-Fabric features in 
[cltl/voc-missives](https://github.com/cltl/voc-missives).

We can use these shared features, they are in `export/tf` and we see that they have been produced
against version `0.8.1` of the corpus data.

This is an example of the practice of using research results that others have shared.

Note that we do not have to do a manual effort to get the data and to integrate it in the corpus.
We refer to them by their location on GitHub, and Text-Fabric does the rest.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import re
import collections
import os

from tf.app import use

Note that we draw in an earlier version, in which there were still some encoding problems.
In later versions these problems are gone.

In a separate notebook we'll show how we can carry over the entity features from an older version to a newer version.

In [4]:
A = use(
    "clariah/wp6-missieven",
    version="0.8.1",
    mod="dirkroorda/voc-missives/export/tf", # using my fork until a pull-request has been accepted
    # mod="cltl/voc-missives/export/tf",
    hoist=globals(),
)

The requested data is not available offline
	~/text-fabric-data/github/dirkroorda/voc-missives/export/tf/0.8.1 not found
	cannot find releases
	cannot find releases




Above you see a new section in the feature list that you can expand to see
which features that module contributed.

Now, suppose did not know much about these feature, then we would like to do a few basic checks.

A good start it to do inspect a frequency list of the values of the new features,
and then to perform a query looking for the nodes that have these features.

## Entities

First the identities:

In [6]:
F.entityId.freqList()[0:20]

(('e_n12_2_632', 8),
 ('e_n13_15_2306', 8),
 ('e_n7_8_809', 8),
 ('e_n13_15_1302', 7),
 ('e_n7_8_1080', 7),
 ('e_t10_15_108', 7),
 ('e_t10_15_273', 7),
 ('e_n10_11_715', 6),
 ('e_n12_14_130', 6),
 ('e_n12_2_383', 6),
 ('e_n12_2_578', 6),
 ('e_n13_15_154', 6),
 ('e_n13_15_1582', 6),
 ('e_n13_15_1894', 6),
 ('e_n13_15_285', 6),
 ('e_n5_28_103', 6),
 ('e_n5_28_34', 6),
 ('e_n5_28_675', 6),
 ('e_n5_7_515', 6),
 ('e_n8_6_710', 6))

In [8]:
len(F.entityId.freqList())

24500

So tens of thousands named entity as been detected, and the most frequent ones occur only 8 times!

Now the kinds of named entities:

In [9]:
F.entityKind.freqList()

(('LOC', 12786),
 ('PER', 10392),
 ('LOCderiv', 4276),
 ('ORG', 3841),
 ('SHP', 2922),
 ('GPE', 1153),
 ('RELderiv', 261),
 ('ORGpart', 58),
 ('LOCpart', 45),
 ('RELpart', 28),
 ('REL', 19))

We guess at the legend:

* `LOC` = location
* `PER` = person
* `ORG` = organisation
* `SHP` = ship
* `GPE` = group entity (??)
* `REL` = relationship (??)
* *xxx*`deriv` = derived from a *xxx*

Let's query all words that have an entity notation:

In [10]:
query = """
word entityId entityKind*
"""
results = A.search(query)

  1.19s 32249 results


Here we query all words where the `entityId` is present.
We also mention the `entityKind` feature, but with a `*` behind it.
That is a criterion that is always True, so these mentions do not alter the result list.
But now these features do occur in the query, and hence will be shown in the results.

In [11]:
A.show(results, condensed=True, end=10)

---

# Contents

* **[start](start.ipynb)** start computing with this corpus
* **[search](search.ipynb)** turbo charge your hand-coding with search templates
* **[compute](compute.ipynb)** sink down a level and compute it yourself
* **[exportExcel](exportExcel.ipynb)** make tailor-made spreadsheets out of your results
* **[annotate](annotate.ipynb)** export text, annotate with BRAT, import annotations
* **[share](share.ipynb)** draw in other people's data and let them use yours
* **entities** use results of third-party NER (named entity recognition)
* **[volumes](volumes.ipynb)** work with selected volumes only

CC-BY Dirk Roorda