In [1]:
import spacy
from spacy.language import Language

In [11]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Britain is a place. Mary is a Doctor.")
for ent in doc.ents:
    print(ent.text, ent.label_)

Britain GPE
Mary PERSON


Like off the shelf entity_ruler, sentencizer components, we can create our custom component.
Lets say, we want to chnage all GPEs to LOC or we want to remove all GPE's. In such case, off the shelf components are not much usable and we can use custom components using <b>@Languge.component</b> decorator in spacy and add it to spacy pipeline

In [12]:
@Language.component("remove_gpe")
def remove_gpe(doc):
    original_ents = list(doc.ents)
    for ent in doc.ents:
        if ent.label_ == "GPE":
            original_ents.remove(ent)
    doc.ents = original_ents
    return doc

In [14]:
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("remove_gpe")

<function __main__.remove_gpe(doc)>

In [8]:
nlp.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'remove_gpe': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False}},
 'problems': {'tok2vec': [],
  

In [15]:
doc = nlp("Britain is a place. Mary is a Doctor.")
for ent in doc.ents:
    print(ent.text, ent.label_)  

Mary PERSON


In [16]:
#nlp.to_disk("data/new_en_core_web_sm") # to save this updated model to the disk

### Regex in Spacy

In [22]:
#usual regex to find patterns
import re

text = "This is a date 2 February. Another date would be 14 August."

pattern = r"((\d){1,2} (January|February|March|April|May|June|July|August|September|October|November|December))"

matches = re.findall(pattern, text)
print(matches)

[('2 February', '2', 'February'), ('14 August', '4', 'August')]


In this bit of code, we see a real-life RegEx formula at work. While this looks quite complex, its syntax is fairly straight forward. Let’s break it down. The first ( tells RegEx that I’m looking for something within the ending ). In other words, I’m looking for a pattern that’s going to match the whole pattern, not just components.

Next, we state (\d){1,2}. This means that we are looking for any digit (0-9) that occurs either once or twice ({1,2}).

Next, we have a space to indicate the space in the string that we would expect with a date.

Next, we have (January|February|March|April|May|June|July|August|September|October|November|December) – this indicates another component of the pattern (because it is parentheses). The | indicates the same concept as “or” in English, so either January, or February, or March, etc.

When we bring it together, this pattern will match anything that functions as a set of one or two numbers followed by a month. What happens when we try and do this with a date that is formed the opposite way?

In [23]:
text = "This is a date February 2. Another date would be 14 August."
matches = re.findall(pattern, text)
print (matches)

[('14 August', '4', 'August')]


It fails. But this is no fault of RegEx. Our pattern cannot accommodate that variation. Nevertheless, we can account for it by adding it as a possible variation. Possible variations are accounted for with a *

In [26]:
pattern = r"(((\d){1,2}( (January|February|March|April|May|June|July|August|September|October|November|December)))|(((January|February|March|April|May|June|July|August|September|October|November|December) )(\d){1,2}))"

text = "This is a date February 2. Another date would be 14 August."
matches = re.findall(pattern, text)
print(matches)

[('February 2', '', '', '', '', 'February 2', 'February ', 'February', '2'), ('14 August', '14 August', '4', ' August', 'August', '', '', '', '')]


There are more concise ways to write the same RegEx formula. I have opted here to be more verbose to make it a bit easier to read. You can see that we’ve allowed for two main options for our pattern matcher.

Notice, however, that we have a lot of superfluous information for each match. These are the components of each match. There are several ways we can remove them. One way is to use the command finditer, rather than findall in RegEx.

In [28]:
pattern = r"(((\d){1,2}( (January|February|March|April|May|June|July|August|September|October|November|December)))|(((January|February|March|April|May|June|July|August|September|October|November|December) )(\d){1,2}))"

text = "This is a date February 2. Another date would be 14 August."
matches = re.finditer(pattern, text) 
print(matches) # returns iterable

for match in matches:
    print(match)

<callable_iterator object at 0x000002BC55F60F70>
<re.Match object; span=(15, 25), match='February 2'>
<re.Match object; span=(49, 58), match='14 August'>


Within each of these is some very salient information, such as the start and end location (inside the span) and the text itself (match). We can use the start and end location to grab the text within the string.

### How to Use RegEx in spaCy
Things like dates, times, IP Addresses, etc. that have either consistent or fairly consistent structures are excellent candidates for RegEx. Fortunately, spaCy has easy ways to implement RegEx in three pipes: Matcher, PhraseMatcher, and EntityRuler. One of the major drawbacks to the Matcher and PhraseMatcher, is that they do not align the matches as doc.ents. Because this notebook is about NER and our goal is to store the entities in the doc.ents, we will focus on using RegEx with the EntityRuler. In the next notebook, we will examine other methods.


In [30]:
text = "This is a sample number 555-5555."
nlp = spacy.blank("en")

ruler = nlp.add_pipe("entity_ruler")
patterns = [
    {"label": "PHONE_NUMBER", 
     "pattern": [
         {"SHAPE": "ddd"},
         {"ORTH": "-", "OP": "?"},
         {"SHAPE": "dddd"}
     ]
    }
]

ruler.add_patterns(patterns)
doc = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.label_)

555-5555 PHONE_NUMBER


This method worked well for grabbing the phone number. But what if we wanted to use RegEx as opposed to linguistic features, such as shape? First, let’s write some RegEx to capturee 555-5555.

In [33]:
import re
text = "This is a sample number 555-5555."
pattern = r"((\d){3}-(\d){4})"
matches = re.findall(pattern, text)
print(matches)

[('555-5555', '5', '5')]


Okay. So, now we know that we have a RegEx pattern that works. Let’s try and implement it in the spaCy EntityRuler.When we execute the code below, we have no output.

In [35]:
text = "This is a sample number 555-5555."
nlp = spacy.blank("en")

ruler = nlp.add_pipe("entity_ruler")
patterns = [
    {"label": "PHONE_NUMBER", 
     "pattern": [
         {"TEXT": {"REGEX": "((\d){3}-(\d){4})"}}
     ]
    }
]

ruler.add_patterns(patterns)
doc = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.label_)

This is for one very important reason. SpaCy’s EntityRuler cannot use RegEx to pattern match across tokens. The dash in the phone number throws off the EntityRuler. So, what are we to do in this scenario? Well, we have a few different options that we will explore next. But before we get to that, let’s try and use RegEx to capture the phone number with no hyphen.

In [38]:
text = "This is a sample number 5555555."
nlp = spacy.blank("en")

ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns (source: https://spacy.io/usage/rule-based-matching)
patterns = [
    {"label": "PHONE_NUMBER", 
     "pattern": [
         {"TEXT": {"REGEX": "((\d){5})"}}
     ]
    }
]

ruler.add_patterns(patterns)
doc = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.label_)

5555555 PHONE_NUMBER


Notice that without the dash and a few modifications to our RegEx, we were able to capture 5555555 because this is a single token in the spaCy doc object. Let’s explore how to solve the multi-token problem in the spacy entity ruler!

### Problems with Multi-Word Tokens in spaCy as Entities
As we saw before from Rules-Based NER, we can use spaCy’s Matcher to grab multi-word tokens, or tokens that span multiple tokens. The main problem with this, however, is that these multi-word tokens are not placed into the doc.ents. This means that we cannot access them the same way we would like other entities. Now, we will figure out how to solve that problem with a simple workflow:

- Extract Multi-Word Tokens with re.finditer()
- Reconstruct the spans in the spaCy doc
- Give priority to longer spans (Optional)
- Inject the Spans into doc.ents

We will cover each of these steps in turn.

### Extract Multi-Word Tokens
First, we need to grab the multi-word tokens. In this case, a person whose first name begins with Paul. In the RegEx below, we specify that we are looking for any string that starts with “Paul” and then is followed by a capitalized letter[A-Z]. We then tell it to grab the entire second word until the end of the word(\w+).

In [39]:
import re
text = "Paul Newman was an American actor, but Paul Hollywood is a British TV Host. The name Paul is quite common."

pattern = r"Paul [A-Z]\w+"
matches = re.finditer(pattern, text)
for hit in matches:
    print(hit)

<re.Match object; span=(0, 11), match='Paul Newman'>
<re.Match object; span=(39, 53), match='Paul Hollywood'>


Note that we have not grabbed the final “Paul” which is not followed by a last name. In this case, we are not interested in that Paul. Now that we know how to grab the multi-word tokens, we need to have a way to parse them in spaCy.

### Reconstruct Spans
This next stage is a bit more complicated, but works quite well once you understand the process. First, we need to import the libraries we will need. Note that we are also adding Span from spacy.tokens.

In [52]:
import re
from spacy.tokens import Span

#re span are character spans but spacy nlp doc object needs token span..thats what we do now

text = "Paul Newman was an American actor, but Paul Hollywood is a British TV Host. The name Paul is quite common."
pattern = r"Paul [A-Z]\w+"
matches = re.finditer(pattern, text)

nlp = spacy.blank("en")
doc = nlp(text)
original_ents = list(doc.ents)
print(f"original_ents: {original_ents}")  # empty cuz of blank spacy language model
multi_word_ents = []
for hit in matches:
    start, end = hit.span()  # char span
    # reconstruct span
    span = doc.char_span(start, end, label="PERSON") # token span, observe the start and end idx,it is token wise now
    print(f"Token Span: {span, span.start, span.end, span.text}")
    if span is not None:
        multi_word_ents.append((span.start, span.end, span.text))
# Inject the Spans into the doc.ents
for mwt_ent in multi_word_ents:
    start, end, text = mwt_ent
    per_ent = Span(doc, start, end, label="PERSON")
    original_ents.append(per_ent)

doc.ents = original_ents # we can only attach Span objects to the doc.ents
for ent in doc.ents:
    print(ent.text, ent.label_)
    
    

original_ents: []
Token Span: (Paul Newman, 0, 2, 'Paul Newman')
Token Span: (Paul Hollywood, 8, 10, 'Paul Hollywood')
Paul Newman PERSON
Paul Hollywood PERSON


### Creating a custom component with this

In [57]:
import re
import spacy
from spacy.tokens import Span
from spacy.language import Language

@Language.component("paul_ent")
def paul_ent(doc):
    original_ents = list(doc.ents)
    #mwt_ents = []
    pattern = r"Paul [A-Z]\w+"
    matches = re.finditer(pattern, doc.text)
    for hit in matches:
        start, end = hit.span()
        span = doc.char_span(start, end, label="PERSON")
        if span is not None:
            mwt_ent = Span(doc, span.start, span.end, label="PERSON")
            original_ents.append(mwt_ent)
    doc.ents = original_ents
    return doc
        

In [58]:
text = "Paul Newman was an American actor, but Paul Hollywood is a British TV Host. The name Paul is quite common."
nlp = spacy.blank("en")
nlp.add_pipe("paul_ent")
doc = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.label_)

Paul Newman PERSON
Paul Hollywood PERSON


### Give priority to Longer Spans
Sometimes, the situation is not so neat. Sometimes our custom RegEx entities will overlap with spaCy’s Entities

In [59]:
import re
import spacy

text = "Paul Newman was an American actor, but Paul Hollywood is a British TV Host."
pattern = r"Hollywood"

nlp = spacy.load("en_core_web_sm")

doc = nlp(text)
for ent in doc.ents:
    print (ent.text, ent.label_)

Paul Newman PERSON
American NORP
Paul Hollywood PERSON
British NORP


Let’s say that we create a new entity. Maybe words associated with Cinema. So, we want to classify Hollywood as a tag “CINEMA”. Now, in the above text, Hollywood is clearly associated with Paul Hollywood, but let’s imagine for a moment that it is not. Let’s try and run the same code as above. If we do, we notice that we get an error.

In [61]:
text = "Paul Newman was an American actor, but Paul Hollywood is a British TV Host."
pattern = r"Hollywood"

mwt_ents = []
original_ents = list(doc.ents)
for match in re.finditer(pattern, doc.text):
    print(match)
    start, end = match.span()
    span = doc.char_span(start, end)
    if span is not None:
        mwt_ents.append((span.start, span.end, span.text))
for ent in mwt_ents:
    start, end, name = ent
    per_ent = Span(doc, start, end, label="CINEMA")
    original_ents.append(per_ent)

doc.ents = original_ents

<re.Match object; span=(44, 53), match='Hollywood'>


ValueError: [E1010] Unable to set entity information for token 9 which is included in more than one span in entities, blocked, missing or outside.

This error tells us that one of our tokens from the finditer() overlapped with one that our “ner” component found. This is a problem that can be rectified with spaCy’s filter_spans. This gives priority to longer spans. Notice how we have allowed the Paul Hollywood entity to be a PERSON, rather than CINEMA. This is because Hollywood is shorter than Paul Hollywood. 

In [64]:
from spacy.util import filter_spans

text = "Paul Newman was an American actor, but Paul Hollywood is a British TV Host."
pattern = r"Hollywood"

mwt_ents = []
original_ents = list(doc.ents)
for match in re.finditer(pattern, doc.text):
    print(match)
    start, end = match.span()
    span = doc.char_span(start, end)
    if span is not None:
        mwt_ents.append((span.start, span.end, span.text))
for ent in mwt_ents:
    start, end, name = ent
    per_ent = Span(doc, start, end, label="CINEMA")
    original_ents.append(per_ent)
    filtered = filter_spans(original_ents)

doc.ents = filtered
for ent in doc.ents:
    print (ent.text, ent.label_)

<re.Match object; span=(44, 53), match='Hollywood'>
Paul Newman PERSON
American NORP
Paul Hollywood PERSON
British NORP
