# Named Entities

A research group at VU University Amsterdam (Piek Vossen VU, Sophie Arnoult)
has applied a NER-algorithm to this corpus (Named Entity Recognition) and 
delivered the results as Text-Fabric features in 
[cltl/voc-missives](https://github.com/cltl/voc-missives).

We can use these shared features, they are in `export/tf` and we see that they have been produced
against version `1.0` of the corpus data.

In this notebook we interpret the named entities further.
First, we collect the words that are marked as part of a named entities into continuous streaks, and we
create `ent` nodes for them. These are the entity occurrences.

Then we take all entity occurrences with the same `eid` and `kind` together into `entity` nodes.

These entity nodes also have the features `eid` and `kind`, and there is an edge feature `eoccs`
from `entity` nodes to the `ent` nodes of their occurrences.

The new nodes and features are in version `1.0.e`

At the moment of writing, the entity features are present in the
[cltl/voc-missives](https://github.com/cltl/voc-missives)
repo, but only on the `dev` branch.
We have made a copy of these features in this repo, in `voc-missives/export/tf`.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from addEntities import AddEntities

In [3]:
AE = AddEntities()

**Locating corpus resources ...**

Name,# of nodes,# slots / node,% coverage
volume,14,426954.79,100
letter,607,9847.39,100
page,11215,532.98,100
table,491,137.91,1
para,34773,100.79,59
remark,24110,97.49,39
head,607,31.12,0
note,12476,16.88,4
line,526918,11.34,100
row,8350,8.1,1


In [4]:
AE.run() 

17756 entity occurrences
17756 good streaks
    0 bad streaks
17756 entity nodes
 4470 distinct eids
 4659 distinct entities
  0.00s preparing and checking ...
  0.15s Feature overview: 42 for nodes; 4 for edges; 1 configs; 9 computed
   |     8.85s done
  8.85s delete features ...
   |     0.00s done (4 features)
   |      |   entityId, entityKind, omap@0.4-1.0, omap@0.8.1-1.0
   |   Delete types: word                : keep:   shift  nodes       1-5977367 to         1-5977367
   |   Delete types: cell                : keep:   shift  nodes 5977368-6009669 to   5977368-6009669
   |   Delete types: folio               : keep:   shift  nodes 6009670-6017568 to   6009670-6017568
   |   Delete types: head                : keep:   shift  nodes 6017569-6018175 to   6017569-6018175
   |   Delete types: letter              : keep:   shift  nodes 6018176-6018782 to   6018176-6018782
   |   Delete types: line                : keep:   shift  nodes 6018783-6545700 to   6018783-6545700
   |   Delete

**Locating corpus resources ...**

   |     1.41s T otype                from ~/github/CLARIAH/wp6-missieven/tf/1.0e
   |       17s T oslots               from ~/github/CLARIAH/wp6-missieven/tf/1.0e
   |     0.00s T title                from ~/github/CLARIAH/wp6-missieven/tf/1.0e
   |       10s T transo               from ~/github/CLARIAH/wp6-missieven/tf/1.0e
   |     7.06s T transr               from ~/github/CLARIAH/wp6-missieven/tf/1.0e
   |       18s T trans                from ~/github/CLARIAH/wp6-missieven/tf/1.0e
   |     8.36s T punco                from ~/github/CLARIAH/wp6-missieven/tf/1.0e
   |       15s T punc                 from ~/github/CLARIAH/wp6-missieven/tf/1.0e
   |     5.98s T puncr                from ~/github/CLARIAH/wp6-missieven/tf/1.0e
   |     0.64s T transn               from ~/github/CLARIAH/wp6-missieven/tf/1.0e
   |     0.54s T puncn                from ~/github/CLARIAH/wp6-missieven/tf/1.0e
   |     1.33s T n                    from ~/github/CLARIAH/wp6-missieven/tf/1.0e
   |      |     

Name,# of nodes,# slots / node,% coverage
volume,14,426954.79,100
letter,607,9847.39,100
page,11215,532.98,100
table,491,137.91,1
para,34773,100.79,59
remark,24110,97.49,39
head,607,31.12,0
note,12476,16.88,4
line,526918,11.34,100
row,8350,8.1,1


Use the annotator API

In [2]:
# A = AE.A

from tf.app import use

A = use("CLARIAH/wp6-missieven:clone", checkout="clone", version="1.0e", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots / node,% coverage
volume,14,426954.79,100
letter,607,9847.39,100
page,11215,532.98,100
table,491,137.91,1
para,34773,100.79,59
remark,24110,97.49,39
head,607,31.12,0
note,12476,16.88,4
line,526918,11.34,100
row,8350,8.1,1


# Make data for a new release

We create a new release with the entities incorporated.

In [6]:
from tf.app import use

In [7]:
A = use("CLARIAH/wp6-missieven:clone", checkout="clone")

**Locating corpus resources ...**

Name,# of nodes,# slots / node,% coverage
volume,14,426954.79,100
letter,607,9847.39,100
page,11215,532.98,100
table,491,137.91,1
para,34773,100.79,59
remark,24110,97.49,39
head,607,31.12,0
note,12476,16.88,4
line,526918,11.34,100
row,8350,8.1,1


In [8]:
A.zipAll()

Data to be zipped:
	OK       app                      (v1.1 7c1946)       : ~/github/CLARIAH/wp6-missieven/app
	OK       main data                (v1.1 7c1946)       : ~/github/CLARIAH/wp6-missieven/tf/1.0e
	OK       extra                    (v1.1 7c1946)       : ~/github/CLARIAH/wp6-missieven/ner
Writing zip file ...


'~/Downloads/github/CLARIAH/wp6-missieven/complete.zip'

In [9]:
!tf-zip --help

### USAGE

``` sh
tf-zip --help

tf-zip {org}/{repo}{relative}

tf-zip {org}/{repo}{relative} --backend=gitlab.huc.knaw.nl
```

### EFFECT

Zips TF data from your local GitHub / GitLab repository into
a release file, ready to be attached to a GitHub release.

Your repo must sit in `~/github/*org*/*repo*` or in `~/gitlab/*org*/*repo*`
or in whatever GitLab back-end you have chosen.

Your TF data is assumed to sit in the toplevel TF directory of your repo.
But if it is somewhere else, you can pass relative, e.g phrases/heads/tf

It is assumed that your TF directory contains subdirectories according to
the versions of the main data source.
The actual `.tf` files are in those version directories.

Each of these version directories will be zipped into a separate file.

The resulting zip files end up in `~/Downloads/backend/org-release/repo`
and the are named `relative-version.zip`
(where the / in relative have been replaced by -)



In [10]:
!tf-zip CLARIAH/wp6-missieven/tf

This is a TF dataset
Create release data for CLARIAH/wp6-missieven/tf
Found 10 versions
zip files end up in ~/Downloads/github/CLARIAH-release/wp6-missieven
zipping CLARIAH/wp6-missieven      0.4 with  35 features ==> tf-0.4.zip
zipping CLARIAH/wp6-missieven      0.5 with  35 features ==> tf-0.5.zip
zipping CLARIAH/wp6-missieven      0.6 with  35 features ==> tf-0.6.zip
zipping CLARIAH/wp6-missieven      0.7 with  41 features ==> tf-0.7.zip
zipping CLARIAH/wp6-missieven      0.8 with  40 features ==> tf-0.8.zip
zipping CLARIAH/wp6-missieven     0.8.1 with  40 features ==> tf-0.8.1.zip
zipping CLARIAH/wp6-missieven      0.9 with  40 features ==> tf-0.9.zip
zipping CLARIAH/wp6-missieven     0.9.1 with  41 features ==> tf-0.9.1.zip
zipping CLARIAH/wp6-missieven      1.0 with  47 features ==> tf-1.0.zip
zipping CLARIAH/wp6-missieven     1.0e with  46 features ==> tf-1.0e.zip
