# Adding Thesaurus Lookup to NER

Here we use the `ThesaurusMatcher` and `EntityFilter` pipeline components to find entities from a thesaurus, and filter out those found which don't look like entities based on a set of rules we've defined.

In [1]:
import sys
sys.path.append("..")

import spacy

from hc_nlp.pipeline import ThesaurusMatcher, EntityFilter
from hc_nlp.spacy_helpers import display_ner_annotations

In [2]:
nlp = spacy.load("en_core_web_lg")
thes = ThesaurusMatcher(nlp, thesaurus_path="../data/labels_all.jsonl", case_sensitive=False)
nlp.add_pipe(thes, before='ner')

nlp.pipe_names

Loading thesaurus from ../data/labels_all.jsonl
298071 term thesaurus imported in 146s


['tagger', 'parser', 'ThesaurusMatcher', 'ner']

In [3]:
if 'EntityFilter' in nlp.pipe_names:
    nlp.remove_pipe('EntityFilter')

# we could also put the EntityFilter before the ner component
entityfilter = EntityFilter(max_token_length=1)
nlp.add_pipe(entityfilter, last=True)

nlp.pipe_names

['tagger', 'parser', 'ThesaurusMatcher', 'ner', 'EntityFilter']

In [4]:
doc = nlp("LAENNEC STETHOSCOPE Laennec stethoscope made by Laennec, c.1820. In 1816, the French doctor René Laennec listened to a young woman's heart through a tube of rolled-up paper to avoid the embarrassment and impropriety of putting his ear to her chest. He called his invention the stethoscope (from the Greek word for chest, stethos), and went on to make wooden versions like this early example. The famous binaural stethoscope came into use in the 1840s. The stethoscope is labelled as follows: 'This is one of Laennec's original stethoscopes, and it was presented by him to Dr Bégin a French Army surgeon whose widow gave it to me in 1863.''")
display_ner_annotations(doc)

In [5]:
# from pattern matching
[(ent.text, ent.label_, ent.ent_id_) for ent in doc.ents if ent.ent_id_]

[('Laennec stethoscope',
  'WORK_OF_ART',
  'https://collection.sciencemuseumgroup.org.uk/objects/co91292'),
 ('Greek',
  'ORG',
  'https://collection.sciencemuseumgroup.org.uk/people/cp99254'),
 ('Dr',
  'WORK_OF_ART',
  'https://collection.sciencemuseumgroup.org.uk/objects/co135433')]

In [6]:
# from NER
[(ent.text, ent.label_, ent.ent_id_) for ent in doc.ents if (not ent.ent_id_) ]

[('Laennec', 'PERSON', ''),
 ('1816', 'DATE', ''),
 ('French', 'NORP', ''),
 ('René Laennec', 'PERSON', ''),
 ('the 1840s', 'DATE', ''),
 ('Laennec', 'ORG', ''),
 ('Bégin', 'PERSON', ''),
 ('French Army', 'ORG', ''),
 ('1863', 'DATE', '')]