# Using Heritage Connector's ThesaurusMatcher with a Historical Gazetteer

Kalyan Dutia (kalyan.dutia at sciencemuseum.ac.uk)

In [1]:
from hc_nlp.pipeline import ThesaurusMatcher, EntityFilter, MapEntityTypes
from hc_nlp.spacy_helpers import display_ner_annotations

import spacy
from spacy import displacy
import pandas as pd

from IPython.display import display, Markdown

pd.set_option("display.max_colwidth", None)

## 1. Loading a spaCy model

Heritage Connector components extend the [spaCy](http://spacy.io/) NLP library where possible. Here we'll import the **English** **small** model `en_core_web_sm`, but for larger models or models in languages other than English, you can choose another model from the list linked below.

- [spaCy models and languages page](https://spacy.io/usage/models)

**Replace `en_core_web_sm` with a model of your choice in the following two cells. Note that if you use a non-English language, you'll want to change the dataset and gazetteer too.**

First we download the model.

In [2]:
!python -m spacy download en_core_web_sm

You should consider upgrading via the '/Users/kalyan/.pyenv/versions/lanc/bin/python -m pip install --upgrade pip' command.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


You may need to restart the kernel after this, and rerun each cell up to the next one (`Kernel -> Restart Kernel...` from the top menu). You don't have to rerun the download cell above.

We then load the model using `spacy.load`. You can see the component parts of the model in the output of the cell below.

In [2]:
# load the model
nlp = spacy.load('en_core_web_sm')

nlp.pipe_names

['tagger', 'parser', 'ner']

We can visualise them as part of an NLP pipeline that converts a piece of text into a [`Doc` object](https://spacy.io/api/doc) with labelled entities as below.

<img src="img/stock_pipeline.png" style="max-width: 70%">

## 2. Loading data

Here we'll load two files:

- some text records from `BL/BL_Medieval_manuscripts.csv`. These contain text in the *Scope & content* field and some places in that text that have been recognised through manual annotation in the *Related places* field.
- a gazetteer created using data from the BL and Historic England


### 2.1 Text records

In [3]:
data = pd.read_csv("./BL/BL_Medieval_manuscripts.csv")

print(f"{len(data)} records loaded ")
data.sample(5, random_state=80)

# here we create a list of text from a specific column for processing later
text = data['Scope & content'].tolist()

460 records loaded 


Unnamed: 0,Record ID (unique internal identifier),Scope & content,Related names,Related places
60,040-002048560,"This manuscript contains the Strategemata (Stratagems) by Sextus Julius Frontinus (b. c. 30, d. 104), a collection of examples of military stratagems from Greek and Roman history. It also contains Books 1-10 of the Historia Romana (History of Rome) by Paul the Deacon (b. c. 720, d. 799). This work incorporates the text of the Breviarium ab Urbe Condita (A Brief History of Rome since its Foundation) by the Roman historian Flavius Eutropius (fl. 370), and also the beginning of Book 11 of the Historia Romana.. This copy of the Frontinus's Strategemata belongs to the Anglo-Norman family, which also includes Cambridge, Peterhouse, MS 252.iii (11th-12th century, Northern France) and Oxford, Lincoln College, MS lat. 100 (written by William of Malmesbury in c. 1125), according to unpublished notes of Michael Gullick. According to the same notes, this copy of Eutropius's Breviarium belongs to the same family as Paris, Bibliothèque nationale de France, ms. lat. 7240 (2nd half of the 11th century, France) and ms. lat. 5802 (2nd or 3rd quarter of the 12th century, France), perhaps copied from the earlier manuscript. The present manuscript and Oxford, Lincoln College, MS lat.100, are close to Paris, Bibliothèque nationale de France, ms. lat. 7240. Contents:. ff. 1r-50v: Frontinus, Strategemata, Books 1-4, imperfect at the beginning and the end, beginning: 'cum hoc opus sicut caetera usus potius aliorum'. ff. 51r-54v: Eutropius, Breviarium ab Urbe Condita, Books 1-2, imperfect at the end, beginning: 'Domino Valenti Maximo perpetuo Augustus'. ff. 55v-109r: Paul the Deacon, Historia Romana, Books 1-11, imperfect at the end, beginning: 'Primus in Italia, ut quibusdam placet, regnavit Ianus', preceded by a rubric 'Incipit liber Eutropii historiographi de romana historia'. Decoration:. Numerous small initials in red or black ink, some with occasional foliate decoration. Rubrics in red.","Bentinck, Margaret Cavendish, duchess of Portland, née Harley, collector of art and natural history specimens and patron of arts and sciences, 11 Feb 1715-17 Jul 1785, http://isni.org/isni/0000 0001 1585 7160, http://viaf.org/viaf/0000 0001 1585 7160 ; Burscough, Robert, Church of England clergyman, 1650/51-1709 ; Durham Cathedral Priory, 1083-1539 ; Eutropius, Flavius, fl 370, http://isni.org/isni/25396473, http://viaf.org/viaf/25396473 ; Harley, Edward, second earl of Oxford and Mortimer, book collector and patron of the arts, 2 Jun 1689-16 Jun 1741, http://isni.org/isni/0000 0001 0807 8249, http://viaf.org/viaf/0000 0001 0807 8249 ; Harley, Henrietta Cavendish, Countess of Oxford and Mortimer, née Holles, patron of architecture, 4 Feb 1694-9 Dec 1755, http://isni.org/isni/n88140850, http://viaf.org/viaf/n88140850 ; Harley, Robert, first Earl of Oxford and Mortimer, politician, 5 Dec 1661-21 May 1724, http://isni.org/isni/N11994,R224660 ; Julius Frontinus, Sextus, c 30-104, http://isni.org/isni/12349897, http://viaf.org/viaf/12349897 ; Knott, Samuel, Rector of Combe Raleigh Devon, 1661-1668, d 1687 ; Paul the Deacon, c 720-799, http://isni.org/isni/40174477, http://viaf.org/viaf/40174477","Durham, England"
360,036-002190313,"Frodesley, Shropshire: Court-rolls: 1404-1460.",,"Frodesley, Shropshire"
129,040-002125661,"Shotton, Sedgefield, Durham: Deed relating to: 1335. Sedgefield, Durham: Deeds relating to: 1315-1335.",,"Sedgefield, Durham ; Shotton, Durham"
304,040-002165853,"Stoke-upon-Tern, Shropshire: Deed rel. to Waranshall in: 1448.",,"Stoke-upon-Tern, Shropshire"
89,040-002125429,"Much Marcle, Herefordshire: Grants of lands in, etc.: temp. Hen.III.-1398.",,"Much Marcle, Herefordshire"


### 2.2 Gazetteer - Creating a ThesaurusMatcher Component

The gazetteer we're using here has been put together from several CSV files using the notebook 'preprocess data.ipynb'.

Gazeteers for the `ThesaurusMatcher` should be in `.jsonl` format with each line like below. The optional 'id' value allows you to link entity mentions back to specific IDs in a database. We're not using it here.

```json
// sample gazetteer.jsonl
{"label": "GPE", "pattern": "Laceby", "id": "optional_id"}
{"label": "GPE", "pattern": "Denby"}
{"label": "GPE", "pattern": "Hauxwell"}
...
```

We load a Gazetteer by creating an instance of `ThesaurusMatcher` with the `thesaurus_path` parameter pointing to our gazetteer. We also decide at this point whether entity matches created using the gazetteer are case sensitive (here we choose for them to be case-insensitive).

In [4]:
thesaurus_matcher = ThesaurusMatcher(nlp, thesaurus_path='./gazetteer.jsonl',  case_sensitive=False)

2020-12-14 09:45:28,626 - hc_nlp.pipeline - INFO - Loading thesaurus from ./gazetteer.jsonl
2020-12-14 09:45:33,605 - hc_nlp.pipeline - INFO - 16032 term thesaurus imported in 4s


## 3. Adding the ThesaurusMatcher to the spaCy pipeline

Now we'll look at **how we can combine machine learning and thesaurus- (or gazetteer-) based approaches** to improve the performance of NER systems on more specialised text datasets, without labelling loads of data.

There are several different ways we can combine the spaCy `NER` and `ThesaurusMatcher` components to produce different results, based on their relative positions in the entity retrieval pipeline.

<img src="img/combination_options.png" style="max-width: 70%">

The figure above shows the configurations of the three options:

1. `ThesaurusMatcher` before `ner`. This means that the NER model adjusts its predictions based on the entities that have been found by matching the text with a gazetteer.
2. `ThesaurusMatcher` after `ner`. In this configuration, the role of the gazetteer is to fill in any entities that the NER algorithm has missed. We might have some luck with this configuration and historical place names, as the NER training data won't contain many of the phrases that are in our gazetteer.
3. `ThesaurusMatcher` after `ner`, where the `ThesaurusMatcher` is able to overwrite entities that have been annotated by `ner`. 

We add the `ThesaurusMatcher` component to the pipeline using `nlp.add_pipe()`, with arguments like `before=`, `after=`, `first=True` or `last=True` to specify its position. I've added the three pipelines below.

In [6]:
nlp_1 = spacy.load("en_core_web_sm")
nlp_1.add_pipe(thesaurus_matcher, before='ner')

nlp_2 = spacy.load("en_core_web_sm")
nlp_2.add_pipe(thesaurus_matcher, after='ner')

# for the third pipeline we create a new instance of ThesaurusMatcher with the extra argument overwrite_ents=True
nlp_3 = spacy.load("en_core_web_sm")
thesaurus_matcher_overwrite = ThesaurusMatcher(nlp, thesaurus_path='./gazetteer.jsonl',  case_sensitive=False, overwrite_ents=True)
nlp_3.add_pipe(thesaurus_matcher_overwrite, after='ner')

# we also add an extra component which filters out some obvious false positives
entityfilter = EntityFilter(ent_labels_ignore=["DATE", "CARDINAL"])

for pipe in nlp, nlp_1, nlp_2, nlp_3:
    pipe.add_pipe(entityfilter, last=True)

nlp.pipe_names, nlp_1.pipe_names, nlp_2.pipe_names, nlp_3.pipe_names

2020-12-13 15:49:16,306 - hc_nlp.pipeline - INFO - Loading thesaurus from ./gazetteer.jsonl
2020-12-13 15:49:18,793 - hc_nlp.pipeline - INFO - 16032 term thesaurus imported in 2s


(['tagger', 'parser', 'ner', 'EntityFilter'],
 ['tagger', 'parser', 'ThesaurusMatcher', 'ner', 'EntityFilter'],
 ['tagger', 'parser', 'ner', 'ThesaurusMatcher', 'EntityFilter'],
 ['tagger', 'parser', 'ner', 'ThesaurusMatcher', 'EntityFilter'])

### 3.1 Comparing model variants

The cells below render annotations for the pure NER model and three different texts.

**To render different texts, change the numbers in `enumerate(text[4:7])`**

In [7]:
for idx, item in enumerate(text[4:7]):
    display(Markdown(f"### -- ITEM {idx+1} --"))
    display(Markdown("#### NER:"))
    displacy.render(nlp(item), style='ent')
    
    display(Markdown("#### 1. ThesaurusMatcher before NER:"))
    displacy.render(nlp_1(item), style='ent')
    
    display(Markdown("#### 2. ThesaurusMatcher after NER:"))
    displacy.render(nlp_2(item), style='ent')
    
    display(Markdown("#### 3. ThesaurusMatcher after NER with overwrite:"))
    displacy.render(nlp_3(item), style='ent')

### -- ITEM 1 --

#### NER:

#### 1. ThesaurusMatcher before NER:

#### 2. ThesaurusMatcher after NER:

#### 3. ThesaurusMatcher after NER with overwrite:

### -- ITEM 2 --

#### NER:

#### 1. ThesaurusMatcher before NER:

#### 2. ThesaurusMatcher after NER:

#### 3. ThesaurusMatcher after NER with overwrite:

### -- ITEM 3 --

#### NER:

#### 1. ThesaurusMatcher before NER:

#### 2. ThesaurusMatcher after NER:

#### 3. ThesaurusMatcher after NER with overwrite:

## 4. Analysis

Use of the gazetteer has managed to clear up some confusion in the model, where it marked entities as PERSON or ORG that were actually places (GPE)<sup>1</sup>.

However, there are still some problems with tagging of the above dataset:
- *de Stanley* in the first example has been tagged correctly as a person by the NER model, but the gazetteer has tagged *Stanley* as a place due because it exists in the gazetteer
- *Edric de Novo Castello* (PERSON) in example 3 has been tagged as ORG by the NER model, and *Novo Castello* has been tagged as a place by the gazetteer, splitting this entity in two.

**In this section we'll look at some approaches that could be used to improve these predictions.**

<sup>1</sup>This is a common flaw in NER models largely due to the fact that people and organisations are often named after places!

### 4.1 Ensuring the gazetteer contains words that are unambiguously places

This could involve e.g. writing a script which looks through a list of known people and removes any items from the gazetteer that also exist in the people list.

### 4.2 Identifying *'de ....'* names using a rule-based approach

There's (at least) one pattern in this dataset which could benefit from rule-based annotation of entities rather than relying on a Gazetteer or machine learning models.

We can use rule-based matching to identify the pattern 'de **Name**' as:

``` python
pattern = [{'ORTH': 'de'},
           {'IS_TITLE': True}]
```

where `'ORTH'` means an exact text match. 

Try [Spacy's rule-based matcher explorer](https://explosion.ai/demos/matcher?text=Notification%20and%20confirmation%20by%20W.%20de%20Mansfeld%2C%20official%20of%20the%20archdeacon%20of%20Durham%2C%20to%20Master%20William%20de%20Kilkenny%2C%20of%20the%20receipt%20of%20letters%20from%20William%20de%20Lanum%20archdeacon%20of%20Durham%2C%20reciting%20others%20from%20Nicholas%20de%20Farnham%20bishop%20elect%20of%20Durham%2C%20whereby%20permission%20is%20granted%20to%20the%20said%20Master%20William%20to%20build%20an%20oratory%20on%20his%20lands%20at%20Stanley%2C%20for%20the%20use%20of%20himself%20and%20his%20family%2C%20because%20of%20the%20distance%20from%20the%20mother%20church.%20Dated%20as%3A%2014%20April%2C%201241%20.%20Endorsements%3A%20...%20de%20Stanley%20...Chaplain%20.....%20et%20....%20Kelkenny%20.&model=en_core_web_sm&pattern=%5B%7B%22id%22%3A0%2C%22attrs%22%3A%5B%7B%22name%22%3A%22ORTH%22%2C%22value%22%3A%22de%22%7D%5D%7D%2C%7B%22id%22%3A3%2C%22attrs%22%3A%5B%7B%22name%22%3A%22IS_TITLE%22%2C%22value%22%3Atrue%7D%5D%7D%5D) to generate rules for entity matching and experiment with them on different pieces of text.

You could then add the Heritage Connector `PatternMatcher` component to your pipeline as per the code sample and diagram below:

``` python
patterns = [[{'ORTH': 'de'}, {'IS_TITLE': True}], 
           [{'another': 'pattern'}],
           ]
pattern_matcher = PatternMatcher(nlp, patterns)

nlp.add_pipe(patternmatcher, before='ner')
```

<img src="img/patternmatcher_pipeline.png" style="max-width: 70%">


## 5. Further Reading

-  the `ThesaurusMatcher` and `PatternMatcher` components are wrappers around spaCy's `EntityRuler` which are designed to be easier to use for those not familiar with spaCy: https://spacy.io/api/entityruler
- after building a pipeline using a combination of rules and machine learning, you might want to use this new model to label some data, then use these new labels to train a new model. A tutorial for training spaCy models is here: https://spacy.io/usage/training
- for applying a pipeline to large amounts of data have a look at the `nlp.pipe` method ([link](https://spacy.io/api/language#pipe)) which contains optimisations for large datasets such as batch and parallel processing
- a description of why spaCy's gazetteer matching is faster than writing a simple dictionary lookup is here: https://explosion.ai/blog/spacy-v2-2#faster-phrasematcher

More information about what we're doing on the Heritage Connector project is available on the [project website](https://www.sciencemuseumgroup.org.uk/project/heritage-connector/) or [project blog](https://thesciencemuseum.github.io/heritageconnector/).